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The purpose of the present study is to show that the quartile devia- 
tion which is commonly, though incorrectly, called the probable error, 
or PE, is not valid as a unit of measurement for educational scales. 
Its defect consists in that it does not possess the one requirement of a 
unit of measurement, namely constancy. It fluctuates from one age 
to another. Hence the calculation of so called ‘‘PE values”’ for test 
items is of questionable value when these values are thought of as 
constituting a scale to represent several ages and grades. I hope to 
show that although the so-called PE scaling procedure is inadequate, 
the problem of scale construction can be solved so as to attain a valid 
unit of measurement for educational achievement. 

In order to demonstrate the variation in the PE unit of measure- 
ment I shall apply an absolute method of scaling to the data for one of 
the best known educational scales. I have previously described this 
method of absolute scaling? and it will be used here on the data pub- 
lished by Trabue? for his language scales. 

It should be stated at the outset that while I shall refer freely to 
Trabue’s monograph with suggestions for improvements in scaling tech- 





1 This is one of a series of articles prepared by members of the staff of the 
Behavior Research Fund, Illinois Institute for Juvenile Research, Chicago. 
Series B, No. 103. 

The writer wishes to acknowledge his appreciation of the clerical assistance 
for a part of this study made available by Dr. Herman M. Adler, director of the 
Institute for Juvenile Research and of the Behavior Research Fund. 

2 Thurstone, L. L.: A Method of Scaling Psychological and Educational Tests. 
Journal of Educational Psychology, October, 1925. 

*Trabue, M. R.: Completion-test Language Scales. Contributions to Edu- 
cation, No. 77, Teachers College, Columbia University. 
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nique, such references should not be taken as derogatory of Trabue’s 
contribution. On the contrary, I have selected his monograph for 
this study because it is unusually complete in the data presented. His 
study was carried out with admirable completeness and he utilized 
the scaling technique developed by Thorndike and it is an improve- 
ment in this technique that I wish to illustrate by the application of 
an absolute scaling method to the data of Trabue’s monograph. 
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Fic. 1.—This is Trabue’s PE scale for his completion tests. It is copied directly 
from his Fig. 17 and Table X XVIII in ‘‘ Completion-test Language Scales,”’ pp. 54-55. 
It is intended to show “relations of grade distributions to each other.’’ Note that the 
spread of completion-test ability of college graduates (CG) is shown to be the same as 
that of Grade II children. A better representation of the spread in these grades is 
shown in my Fig. 6 and Table IV, obtained by an improved scaling technique which 
does not make the assumption that the spread is constant in all grades. 


One of the principal final results of Trabue’s calculations is repre- 
sented in his Fig. 17 which is reproduced here as my Fig. 1. In repro- 
ducing the figure, I have separated the distributions of language ability 
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of the several grades on separate lines in order to facilitate analysis. 
Otherwise the figures are identical. 

The interpretation of the diagram is as follows: On the upper line 
is a probability curve representing the frequency distribution of lan- 
guage ability to be expected in Grade II. The curve can be drawn 
to any convenient scale and the unit of measurement is the standard 
deviation of that surface or the quartile deviation (the so-called PE). 
The sentences in the language tests are given sigma values (or PE 
values) in accordance with the proportion of Grade II children who 
fill in the sentences correctly. The more difficult sentences will, then, 
be located toward the right of the mean while the easier sentences will 
be located toward the left of that origin. 

On the second line of Fig. 1 is drawn another probability curve with 
the same standard deviation as the first. By this curve Trabue repre- 
sents the distribution of language ability for Grade III children and it 
is located slightly to the right to represent the fact that the children 
move up in ability from the second to the third grade. Hence the 
mean of the Grade III distribution is higher than that of Grade II 
when the two surfaces are considered with reference to the same base 
line. On the base line of this surface one may also locate the sentences 
and it will be noticed that since the two surfaces overlap, many of the 
sentences will be common to both the Grade II and the Grade III 
distributions. The sigma value of each sentence may also be deter- 
mined for the Grade III distribution by the proportion of Grade III 
children who fill in each sentence correctly. Since we have two inde- 
pendent sigma values for each of the overlapping sentences, one for 
the Grade II children and one for the Grade III children, we are able to 
determine the lateral displacement between the two surfaces. It is 
by this means that one may determine the inter-grade interval, as it 
is called, which is the lateral or base line distance between the means of 
the two distributions. 

At this point appears an assumption which is involved in the educa- 
tional scaling technique developed by Thorndike, an assumption which 
is rarely stated explicitly and which will appear in our analysis to be 
not valid. The assumption is that when the distributions of abilities 
of several grades are plotted on a common base line or scale, their 
spreads or standard deviations will be equal. That assumption is 
implicit in the scaling technique of Thorndike but the assumption 
is not valid. In Trabue’s monograph Fig. 17, with its supporting 
calculations (reproduced as my present Fig. 1), shows the spread in 
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language ability of Grade II children to be the same as that of college 
graduates. Now it seems hardly likely, even without the application 
of an absolute scaling method, that the dispersion or spread in ability of 
Grade II children can be as great as that of college graduates because 
with the Grade II children there can be relatively little language ability 
to spread, whereas college graduates whose mean performance is of 
course very much higher on the scale have much more possibility for 
variation. However, this is a priori reasoning which might be incor- 
rect although it seems plausible. The absolute scaling method to 
which we can submit the data answers this question on a factual basis. 
We shall then find that the PE unit commonly used in educational 
measurement is more than twice as large for the high school seniors 
as it is for Grade II children. It can also be shown that the incon- 
sistencies in scale values which Trabue finds in his data are attributable 
to the fact that his unit of measurement, the PE, is subject to gross 
variation in magnitude. 

In order that the absolute scaling method may be clear, apart from 
its algebraic setting, let us consider first a hypothetical example in 
which the numerical relations have been simplified. Let Fig. 2 repre- 
sent two grade distributions which refer to the same educational scale 
as a base line. The surfaces are drawn on two separate base lines for 
the sake of clarity. Let the distribution A represent the lower grade 
and B the upper grade. The mean performance of B is represented 
as higher in the scale than that of A which is to be expected since B 
is the higher grade. The lettered points on the base line of A may 
represent sentences, or other test items, which have been allocated 
along the base line in accordance with the proportion of right answers 
for group A. Let these sigma values take the numerical form shown 
in Table I. It will be seen that the sentences are arbitrarily spaced 
one sigma apart. Assume that the B distribution really has a spread 
twice that of group A as indicated in Fig. 2 and that the mean per- 
formance of grade B is equal to a performance of +3.0¢ in the A-distri- 
bution. Then the sigma values for the same sentences as determined 
by the performance of grade B are as represented in Table I. 

Thorndike’s scaling method consists in determining first the scale 
value of each test item for each grade separately with the mean of each 
grade as an origin. The difficulty of a test item for Grade V children, 
for example, is determined by the proportion of right answers to the 
test item in that grade. The difficulty is expressed as a deviation 
from the mean of that grade. When a test item has been scaled in 
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several grades, the scale values so obtained will of course be different 
because of the fact that they are expressed as deviations from different 
grade means as origins. He then reduces all of these measures to a 
common origin in the construction of an educational scale by adding to 
each scale value the scale value of the mean of the grade. This proce- 
dure assumes that the distributions of abilities in the several grades 
are all of the same dispersion and that they differ only in lateral dis- 
placement along the scale. The procedure is well illustrated in Fig. 1 
or in Trabue’s Fig. 17. 
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Referring to Fig. 2 it is clear that in order to reduce the overlapping 
sentences or test items to a common base line or scale it is necessary to 
make not one but two adjustments. One of these adjustments con- 
cerns the means of the several grade groups and this adjustment is 
made in the Thorndike scaling methods. The second adjustment 
which is not made by Thorndike concerns the variation in dispersion 
of the several groups when they are referred to a common scale. 

Table I represents for the fictitious and simplified example the 
scale values of the six test items in the two grades. If these paired 
scale values are plotted, the result is Fig. 3. The slope of this line 
is the ratio of the two dispersions, cA/cB. The slope of the line in 
Fig. 3 is 4 and this is consistent with the fact that in the hypothetical 
example the ratio cA/oB = 144. We are dealing: here with two differ- 
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ent units of measurement, one for the A-distribution and another for 
the B-distribution. We may construct a common scale for the two 
distributions with either one of these two units but we can not guar- 
antee consistent results by assuming that they are equal. 

For the purpose of absolute scaling I have used the data of Trabue’s 
Table XXXIV, page 62. The detail of the new method has been 
described in a previous article. Tarbue’s table was converted into 
a table of proportions and this in turn was re-stated in terms of sigma 
values for each grade separately. 
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The method consists in scaling each adjacent pair of grades. From 
the calculations one obtains two facts, the displacement of the two 
means, and the ratio of the two dispersions. In Table II an example of 
the calculations is reproduced for Grades VII and VIII. The first 
column shows the sentence number which corresponds to the number- 
ing in Trabue’s table. Columns 2 and 3 show the sigma values of the 
sentences for Grade VII, the tabulation having been separated into 
two columns for the positive and negative sigma values. Columns 4 
and 5 show similarly the sigma values for the overlapping test sentences 
in Grade VIII. Only those sentences were here recorded which had 
proportions of right answers in both of these grades of more than .10 





1 Thurstone: Loe. cit. 
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and less than .90. The elimination of the extreme proportions is 
made on account of the low reliability of these proportions. Strictly 
speaking, all of the sigma values should be weighted in accordance 
with the corresponding proportions but this has not been done on 
account of the considerable increase in arithmetical labor. 

The last two columns in Table II give the squares of the sigma 
values in the previous columns. Table II shows a complete arrange- 

ment of the data for the absolute scaling of the two grades, VII and VIII, 
upon the same base line. The necessary summations are also given 
in Table IT. 

Table III shows a summary of all the necessary calculations for 
scaling Trabue’s data for the grade combination VII and VIII. These 
calculations refer to equations (6) and (7) in my previous article. 
The mean of the lower grade, M7, and the standard deviation of that 
grade group, a7, are obtained from similar calculations for the grade 
combination VI and VII. The unit of measurement adopted for the 
scaling is the standard deviation in language test ability for Grade II, 
and the origin for the scale is arbitrarily located at the mean perform- 
ance of Grade II. It may be seen from Table III that 


o7 = 1.22081, 
= 1.327130, 


Hence these two grade groups do not have the same dispersion in 
language ability. The ratio between their dispersions is 


7 — 91988 
os 


and when it is considered that there is a progressive increase in disper- 
sion from one grade to the next, this variation plays havoc with the 
educational scaling methods which assume that the dispersion is 
constant. 

The steps in the absolute scaling method can be summarized briefly 
as follows: 

1. Prepare a table showing the proportion of correct answers for 
each test item for each grade. 

2. Convert this table into another table shonin the sigma value of 
each test item for each grade or age group. This is done in the usual 
manner with the aid of a probability table. If quartile deviations 
(the so-called PE) are preferred, the table may be so arranged 

3. For each pair of adjacent grades construct a table like Table II. 
Since Trabue’s data extend from Grade II to Grade XII inclusive, 
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there were in this study 10 tables similar to Table II. These tables 
represent respectively the grade combinations 2-3, 3-4, 4-5, and so on 
to 11-12. 

4. Make the summations indicated in Table II. 

5. Then arrange the calculations as indicated in Table III for each 
grade combination. In this study there were therefore 10 tables of 
calculations similar to Table III. The dispersion of one of the grade 
groups may be selected as a unit of measurement and its mean may be 
chosen for an origin. The standard deviation in completion test 
ability for Grade II was chosen as the unit of measurement for the 
whole scale and its mean performance was chosen as the origin, or 
zero, for the scale. It is clear that any other grade may be so used 
and the origin, or zero, may be placed at any other position, but it 
should be remembered that educational scales of this type do not have 
any real zero. The origin is necessarily in the nature of an arbitrary 
zero. 

The calculations reproduced in Table III yield two important facts; 
namely, the scale value of the mean performance of Grade VIII and 
the standard deviation of performance in that grade. These are 
respectively 

Ms = 3.07644 
and 
os = 1.32713 


The calculations should be carried to five decimal places even though 
the final figures may not be quoted to more than one or two decimals. 
This is a precaution which is advisable on account of the form of the 
algebraic expressions involved. 

In addition, Table III also shows the equation of the line of rela- 
tion for the scale values in the two grades. This equation, for Grades 
VII and VIII, is as follows: 


Xs = .91988X7 — .31722 


in which Xx represents the sigma value of a test item for Grade VIII 
and X7 represents the sigma value for the same test item from Grade 
VII. 

6. A graph is then drawn for each grade-combination like Fig. 4. 
This graph shows the relation between X7 and Xs and the numerical 
values are obtained directly from Table II. Since there are 36 sen- 
tences which give proportions of correct answers in both Grades VII 
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and VIII which are less than .90 and more than .10 there will be that 
many points shown in Fig. 4. 

7. Plot the equation of the line of relation obtained in Table III 
on the graph of Fig. 4. If the arithmetical work is correct this line 
should be a good fit for the plotted points. Figure 4 shows a good fit 
and it thereby guarantees that at least no gross arithmetical errors 
have been made. 

If the plot in Fig. 4 should be distinctly non-linear, the present 
scaling method is not applicable. Non-linearity here shows that the 
































2 
% #/ ao 
: ue 
8 | p 
S Al 
£ oO © 
© ce 
S . 
S 

-/t 5 
; ¥ 
S q 
%) ° 

“Ee ~/ ) w 22 

Sigma values in grade 7 
Fig. 4. 


two distributions concerned cannot both be normal on the same scale. 
If the plot is linear, it proves that both distributions may be assumed 
to be normal on the same scale or base line. The slope of the line 
shows the ratio of the two dispersions and the intercepts show the 
lateral displacement of the two means when both distributions are 
plotted on the same scale 

The results for all the 10 grade combinations in Trabue’s data are 
shown in Table IV. Note that the dispersion for Grade VIII is about 
33 per cent larger than that of Grade II and that the dispersion of 
Grade XII is more than twice as large as that of Grade II. Accord- 
ing to the customary methods of constructing educational scales these 
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dispersions are assumed to be equal. In order to have a scale with 
positive scale values for all the sentences, the origin was placed at — 3c, 
for the column of scale values in Table IV. 

The rate at which dispersion in the sentence completion test 
increases through the grades is shown in Fig. 5. Note that the rate of 
increase is positive. The conclusion is inevitable that children tend 
to become more unlike in the ability measured by the Trabue tests as 
they progress through the grades. Figure 5 is obtained from Table IV. 
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Fig. 5. 


We are now ready to draw a graph showing the distributions of 
completion test abilities in all of the grades upon the same scale. This 
has been done in Fig. 6 the data for which are obtained from Table IV. 
Note that these distributions do not look like those of Fig. 1 which 
were copied directly from Trabue’s Fig. 17 and Table XXVIII. 

When we have ascertained the mean and standard deviation of 
each grade-group, referred to a common base line or scale, we may allo- 
cate each test item on that scale. Trabue has done this with consider- 
able completeness but he says of his tabulation: ‘‘It will be noticed in 
Table XXIX that there is a tendency for each sentence to have a 
higher location in the higher grades than it has in the lower grades.” 
It is more than a ‘‘tendency.”’ In his more complete Table XXXVI, 
all but one of the 56 sentences jump in scale value about three or four 
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PE units and some as much as 5.5 PE units which is more than half 
the range of the whole scale. He attempts an explanation for it which 
is probably not correct. The reason is that the unit by which the scal- 
ing is accomplished is itself variable. It increases with age, as has 
been shown, and it should therefore not be expected that the scale 
values of the items should remain constant when determined inde- 
pendently for the different grades. 
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In his Table XXXVI, page 64, sentence 1, for example, is given a 
“location above zero”’ of 1.15 when determined by the data for Grade 
II, and 6.65 when determined by Grade XII. This conspicuous drift 
in the data is avoided by the scaling method here proposed. 

Our analysis corresponding to Trabue’s Table XX XVI is as follows: 
For every grade, we have calculated a mean, M, and a dispersion, c, 
and these are listed in Table IV. The scale value of any particular 
sentence is then determined by the relation 


S. = M, + Xu, (1) 
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in which 


S, = difficulty or scale value of sentence k. 
M, = grade mean of grade g. 
X, = sigma value of observed proportion of correct answers to sentence k 
in grade g. 
o, = dispersion of ability in grade g. 


The application of the above equation may be seen in the following 
example: Referring to Trabue’s Table XXXIV, page 62, we find that 
1008 children in Grade VI filled in sentence 20 correctly. Since 
Trabue gives 1158 as the perfect score, he calculates the proportion of 
correct answers to be equivalent to about 87 per cent. With these 
data given, we determine the sigma value of this question to be 
—1.126¢¢. 

The values of M¢ and ag are obtained from our Table IV and we 
then have 

S = 5.27918 — 1.126 X 1.13879 = 3.997 


This is, then, the scale value of sentence 20 as determined by the per- 
formance of Grade VI children. In a similar manner we have calcu- 
lated the scale value of each of the 56 sentences for each of the twelve 
grades. The results are shown for the first 15 sentences in Table V. 
The rest of the table can be readily reproduced from the raw data of 
Trabue’s Table XXXIV and our Table IV. 

Table V has been arranged so that the two scaling methods may 
be directly compared. At the top of the table are given three facts 
for each column namely (1) Trabue’s grade mean, (2) our grade mean, 
and (3) our measure of dispersion for each grade. The sentence 
numbers are indicated in the first left-hand column. The first line of 
each group of three entries represents Trabue’s scale value, obtained 
from his Table XXXVI. The second entry is our scale value. In the 
third line is recorded the weight that is to be assigned to our scale 
value. 

The comparison may be made horizontally. For example, Trabue 
calculated the scale values for sentence 10 in the different grades as 
follows (see my Table V or his Table XXXVI): 


ScaLE VALUES FOR SENTENCE No. 10 


ae II III IV V_ VI VII VIIT IX X XI XII 
RR 2.79 3.23 3.86 3.98 4.21 4.53 5.05 5.22 6.32 6.62 6.83 
Absolute scaling... 3.74 3.86 3.93 3.84 3.67 3.53 3.65 4.00 3.85 3.61 3.42 
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Note that the Trabue scale values range from 2.79 to 6.83 PE units 
while our scale values, determined by the method of absolute scaling, 
hover about 3.8 as a mean. 

Similar comparisons may be made for other sentences in Table V. 
The first line for each sentence shows the Trabue scale values for that 
sentence in the different grades. The second line for each sentence 
shows our scale values, determined by the method of absolute scaling. 
Note the constant increase in Trabue’s scale value for each sentence 
from grade to grade and the smaller chance fluctuations in our scale 
values. 

On the basis of a tabulation similar to Table V for all of the 56 
questions we have ascertained the number of sentences for which the 
last entry is higher than the first. I find that in 22 sentences the last 
entry happens to be higher than the first, while in 34 they happen to 
be lower than the first. The fluctuation in scale values in.our data 
are due primarily to variable or chance errors which are in comparison 
small. 

Before a mean scale value can be calculated for each sentence, 
it is necessary to note that the reliabilities of the scale values for any 
particular sentence are not the same in the several grades. The 
scale values recorded in Table V must be weighted before a mean scale 
value for each sentence can be calculated. There are three factors 
that determine the reliability of a scale value calculated from equa- 
tion (1). These are (1) the number of cases on which the original 
proportion of correct answers is calculated, (2) the actual proportion 
of correct answers, and (3) the disperson of the grade-group. The 
reliability of a proportion of a normal probability surface is as follows :* 


senate is 
0 SiN _ 


Since the weight should be inversely proportional to the square of the 
reliability, we have 
n Z? 
The value of p is known from the raw data, and also the number of 
cases, n. The factorg = 1-—  p. The value of Z is ascertained in a 
mn Z? 

probability table. In the present study the factor - was calcu- 
lated for values of p which were recorded to two decimals only. The 
1 See Kelley: “Statistical Method.” 





P. 90, equation 43. 
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value of ¢ is found in TableIV. The weights so determined constitute 
the third entry of each set in Table V. 

By means of the weights we are now able to calculate the weighted 
mean scale value for each of the 56 sentences. The weighted mean 
scale value for each sentence is 


isis Wr.1 Ski + Wee Sea + Wes Ses + + + Were Sere (3) 
(Wer + Wee + West + * * Wes) 
in which 


m, = weighted mean scale value of sentence k. 
Wk.1, Wk.2, We.3, etc. = weight assigned by equation (2) to sentence 
k in Grades I, II, III, etc., respectively. 
Sz.1, Sz.2, Se.s = scale value assigned by equation (1) to sentence 
k in Grades I, II, III, etc., respectively. 


Table VI shows the weighted scale value for each of Trabue’s 56 
sentences. In this table the origin is placed at —3e-2 in which a2 isthe 
dispersion of ability in Grade II andozis itself the unit of measurement. 





SUMMARY 


1. The main purposes of this study have been (1) to show that the 
quartile deviation or the so-called PE is entirely unsuited as a unit in 
educational scale construction because of the fact that it is subject to 
considerable variation among the different grades and ages, and also 
(2) to offer a solution to the problem of scale construction. The 
variation in the PE unit has been shown on the data of Trabue for 
sentence completion tests because his study is one of the best known 
and best prepared scaling projects of the type here discussed. 

2. It has also been shown that inconsistencies which Trabue dis- 
cusses in his own scale values disappear by the application of an 
absolute scaling method which takes into account the fact that dis- 
persion is itself a variable among different grade and age groups. The 
detailed description of this absolute scaling method has not been 
repeated here because it has previously been described in another 
article, but examples of each type of calculation involved have been 
reproduced here to facilitate the application of the method to other 
educational scaling data. 

3. The method of absolute scaling enables us to ascertain whether 
two age or grade distributions can be represented as normal surfaces 
on the same base line even though the original raw-score distributions 
be skewed. The test consists in the linearity of the plot illustrated in 
Fig. 4. 
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4. The method of absolute scaling is independent of the number of 
easy or difficult items in the scale. The fact that the test items are 
bunched close together in difficulty or spread apart on the scale has 
no effect on the scaling of the several age or grade groups. The fact 
that one happens to use questions that are relatively hard or easy for 
a grade group has no effect on the inter-grade interval or the inter-age 
interval in the absolute scaling method here described unless the data 
are so badly bunched as to affect the scaling on account of unreliability 
of extreme proportions. 

5. The method of absolute scaling here used is based on successive 
comparisons of adjacent age or grade-groups, such as II-III, III-IV, 
IV-V, etc. For the present, only those test items have been included 
which gave proportions of correct responses between 10 per cent and 
90 per cent in both of the adjacent groups and they have all been given 
equal weight. A more complete absolute scaling procedure will be 
described in a separate article in which all of the test items are included 
with weighting in accordance with the reliability of the proportion of 
correct answers for each test question. Each grade mean and grade 
dispersion will then be ascertained by two normal equations. In the 
complete procedure, the scale might be constructed by comparing all 
grades simultaneously instead of adjacent pairs of grades in succession. 
This procedure would be laborious for Trabue’s data in that it would 
require the solution of 20 normal equations with as many unknowns, 
namely, 10 scale values for the grade means and 10 grade dispersions. 
The labor of such a solution is almost prohibitive. Hence the com- 
promise by which adjacent grade groups are scaled in succession to 
cover, stepwise, the whole grade-range. 

The central idea involved in the method of absolute scaling is to 
provide for the varying dispersion of ability in successive ages and 
grades. Thorndike’s scaling method makes the tacit assumption 
that the dispersion is constant, an assumption which has been demon- 
strated to be not valid. The method of absolute scaling may be con- 
sidered as an improvement on Thorndike’s scaling in two regards, 
namely, (1) by providing for the varying dispersion, and (2) by pro- 
viding a rational procedure by which all of the data may be adequately 
taken into account. We have called the method absolute, not in the 
sense of measurement from an absolute origin but in the sense that 
the scale is independent of the unit selected for the raw scores and of 
the shape of the distribution of raw scores. 
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A TEST OF SCIENTIFIC APTITUDE! 
D. L. ZYVE 
Stanford University 
I 


The unprecedented parallel growth of modern industry and higher 
education makes the application of scientific principles of systematiza- 
tion, organization, and selection essential to their sound development. 
The selection of satisfactory raw material for various fields of endeavor 
is equally important to both, and it is a matter of elementary wisdom 
to direct youth away from vocations or professions for which they have 
no other aptitude than a vague desire based upon extraneous considera- 
tions or the advice of a friend or parent. 

It would be highly instructive to determine the amount of waste 
resulting from such facts as that about 50 per cent of graduate 
engineers are found 15 years after graduation in fields of endeavor 
having little or nothing to do with the special training which they 
have received in schools of engineering.2, How much discontent does 
it represent? For intelligent guidance, according to E. P. Cubberley, 
is pot merely a matter of efficiency, but a matter of safety, “ protecting 
society from the dangers that arise when adults find themselves in 
work for which they have no aptitude.’ 

The true aptitudes of the thousands of incoming students must be 
determined by some objective and reliable method before they can be 
intelligently advised and guided in the direction of their aptitudes. A 
“half hour talk’’ with the student is little more than an illusion and 
from the viewpoint of economy utterly inadequate; it presupposes, 
moreover, on the part of such an adviser, an almost infinite insight and 
wisdom. 

Some of the leading engineers of today begin to realize that admis- 
sion to engineering schools ‘“‘should be based on selective tests for fit- 
ness, intelligence and character as well as for knowledge.’’ 





1 This study was suggested early in 1925 by Professor L. M. Terman, to whom 
the writer is greatly indebted for kindly encouragement as well as for clerical help 
placed at his disposal. 

2 Lyford, O. S.: The Engineer as a Leader in Industry. Bull. of the Society 
for the Promotion of Engineering Education, Vol. XIV, No. 4, Dec., 1923. 

? Cubberley, E. P.: “Public Education in the United States.” Houghton 
Mifflin, 1919, p. 21. 

4 See Lyford: Loc. cit., p. 172. 
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Since no one single test can embrace all the multitudinous mental 
and character traits of an individual, the problem of intelligent guid- 
ance must be approached from several sides by means of objective, 
valid and reliable tests of the most important of these traits. One 
of these traits is aptitude for science or engineering. 

We say aptitude rather than specific ability, or knowledge, for 
“‘capacities”’ or ‘‘aptitudes”’ “‘are controlled by laws inherent in the 
organism and are very loosely dependent upon the individual’s achieve- 
ment, whereas the abilities of an individual are chiefly determined by 
his experiences and achievements.’’! 

For this reason the Test of Scientific Aptitude, by means of which 
we have conducted our experimental investigation, is not based upon 
information beyond the scope of the elementary school. As a matter 
of fact, wherever specific information is necessary it is furnished in the 
test. But before we turn to the discussion of the test we may ask: 
Is there such a thing as scientific aptitude? And if so, what is it? 

We shall define science as organized knowledge based upon experi- 
ment and observation. This definition, purposely narrow, will 
enable us to draw a fairly sharp line of demarcation between natural 
sciences (including engineering) on the one hand, and social sciences 
and art on the other. Such a definition, moreover, will enable us to 
analyze scientific aptitude in terms of elements of scientific method as 
best illustrated by the physical sciences. 

Science is a product of interaction of two processes: the observa- 
tional and the descriptive. ‘‘In the progressive study of natural 
phenomena, that is, the phenomena of the external world, the first 
work is to observe and classify facts; the process of inductive generali- 
zation follows, in which the laws of nature are the objects of research.’”? 
In other words, there are two stages in science: natural history or the 
observational stage, and natural philosophy or the descriptive stage, 
the main purpose of which is the formulation of the so-called ‘“‘laws”’ 
of nature. Accordingly, there are at least three types of scientists: 
1. Those whose main endeavour is confined to the observational 
stage—Tycho Brahe may be taken as an example of this group. 
2. Those who excel primarily in the descriptive stage—Maxwell is 
a striking example of this type. 3. Those whose achievements in 





1 Koffka, K.: “The Growth of the Mind.” Harcourt, Brace and Co., New 
York, 1925, pp. 39-40. 
2 Lord Kelvin: Introductory Lecture, 1846. 
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both fields are equally important. Darwin, Faraday, Joule, Kelvin, 
Pasteur and Berthelot may be taken as examples of this third type. 

The question of scientific aptitude closely connected with scientific 
method is not a new one, nor is it a question upon which there is much 
agreement among men of science or among philosophers. We need 
not be surprised at the wide range of opinions expounded by various 
thinkers on the subject. They range from puerile exaggerations to 
statements accessible to experimental verification. So, for example, 
there are some who believe that ‘‘the only qualifications required for 
the study of Nature’s storybook are devotion to truth and sincerity 
of spirit, and the other qualities will come to the possessor of these.’”! 
In other words, moral traits rather than intellectual traits are essential 
to what may be called scientific aptitude. As for the intellectual 
traits, they ‘‘will come”’ later. To others, as for example to Buffon, 
the matter is as simple; not merely aptitude, but even genius is pri- 
marily and fundamentally patience, or sustained attention. As for 
Jevons, he goes so far as to declare that genius in science is ‘‘a phe- 
nomenon beyond the domain of the laws of nature.’”’? The value of 
this type of analysis is at best doubtful. 

We shall find much more definiteness in Bacon’s analysis of scienti- 
fic aptitude. According to him, the main constituents of scientific 
aptitude are: love of truth, passion for research, power of suspended 
judgment, accuracy of statement, courage of steadfast endurance, 
accuracy of patient observation, freedom from preconceived ideas.* 
Apparently, in Bacon’s judgment, originality or creative imagination 
is not an important element, as he considers his inductive method the 
surest way to discovery. 

According to H. Davy “ patience, industry and neatness in manip- 
ulation; accuracy and minuteness in observing and registering the 
phenomena which occur, are essential’’ characteristics of a scientist. 
The research man in science should have “‘no preconceived notions of 
his own.”” However: “‘The imagination must be active and brilliant 
in seeking analogies; yet entirely under the influence of the judg- 
ment in applying them. The memory must be extensive and 
profound.’’4 





1 Jevons, W. S.: ‘‘ The Principles of Science.” Macmillan, 1887, p. 575. 

2 Gregory, Sir R.: “‘Discovery.”’ Macmillan, 1924, p. 44. 

3 Bacon, F.: ‘Advancement of Learning and Novum Organum.” Colonial 
Press, New York, 1900. 


* Paris, J. A.: “Life of Sir H. Davy.” Colburn and Bentley, London, 1831. 
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The views of Darwin, Faraday, Berthelot, Helmholtz and many 
others are not fundamentally different from those of Davy, although 
Helmholtz believed that various branches of science required special 
aptitudes. This point of opinion need not be unduly stressed. The 
differences in aptitude between a biologist, chemist, physicist or 
engineer are, beside interest, secondary in nature. They all 
undoubtedly have something in common and that something is their 
aptitude for using the scientific method of approach in various prob- 
lematic situations, characteristic of natural sciences. 

Of course, the intuitionists—and among them we find men of such 
magnitude as Pierre Duhem and Henri Poincaré—may object that 
great discoveries have often been made not by the cut and dried 
scientific method, but through sub-conscious cerebration the laws of 
which are totally unknown tous. ‘The scientific genius like the great 
artist,’’ says S. Herbert, ‘‘is frequently quite unconscious of the whence 
and why of his best discoveries.” It may be answered that since the 
laws of sub-conscious cerebration are unknown to us, we have no 
reason whatsoever to suppose that they are different from those 
underlying our conscious thinking; besides, no discovery has ever 
been made unconsciously, unless by sheer accident; in every case the 
sub-conscious process was necessarily preceded and followed by wide- 
awake observation and reasoning, analogical thinking, and experi- 
mentation—the very essence of scientific method. It is with this 
stage of scientific thinking and procedure that our study is exclusively 
concerned. 


II. THe COMPONENTS OF THE TEST OF SCIENTIFIC APTITUDE 


If scientific aptitude S is a complex conglomerate of mental and 
character traits, as is probably the case, it ought to be analyzable into 
its components such as: 


S=A+B+C+D+E+F4+G6+4+--- ete. 


where A is, let us say, ability to reason (original, not routine reason- 
ing), B ability to form generalizations through the inductive method, 
C accuracy of observation, D discrimination of values in selecting and 
arranging experimental data, E power of suspended judgment, F 
patience or sustained effort, G imagination (including creative imagina- 
tion), H devotion to truth, J muscular coordination, J orderliness, K 
sensory acuity, and so on. 





1 Herbert, S.: ‘‘The Unconscious Mind.”” A. and C. Black, London, 1923. 
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Even a superficial examination of these variables suggests that for 
our purpose they are not of equal importance. Some of these factors 
may be fundamental and their lack may not be compensated by train- 
ing, as, for example, a consistent deficiency in reasoning ability; others 
may be of lesser importance and subject to improvement, or to develop- 
ment through training, such as, for example, lack of caution or a 
tendency toward hasty generalization; still others may be purely 
accidental and due primarily to environment, such as habits of work 
or ways of living. 

Which of the elements should be incorporated into a scientific 
aptitude test? 

The last group is obviously beyond the scope of our study, and 
will not be taken into consideration. The elements of the second 
group, while not essential, according to some men of science, are 
significant for the purposes of differentiation, and, therefore, should be 
incorporated into our definition of scientific aptitude. As for the 
elements of the first group, they are, as we have said, fundamental, 
and form the ‘‘core”’ of scientific aptitude. Even reduced to its 
fundamental elements the resulting compound, which we call scientific 
aptitude, may be very complex. Moreover, neither the total number 
of its fundamental constituents nor their nature may be known. 
The last difficulty, however, is not crucial, if we succeed in proving that 
a test based upon a limited number of elements may show an acceptably 
high correlation with aptitude as detected by some other method, such as 
actual performance in the field of science. 

We shall then include in the test of scientific aptitude the following 
elements: 


1. Clarity of definition, i.e., the ability of the student to differentiate better 
definitions from poorer ones, and appreciate their relative values. 

2. Suspended vs. snap judgment, i.e., the tendency of the student to draw final 
conclusions from insufficient data. 

3. Experimental bent, i.e., the tendency of the student towards experimentation. 

4. Discrimination of values in selecting and arranging experimental data. 

5. Detection of fallacies and contradictions. 

6. Reasoning, i.e., the ability to reason not only according to well-established 
rules such as may be found in certain typical mathematical problems but also, 
so far as possible, original reasoning. 

7. Accuracy of systematic observations, i.e., the ability to observe patiently and 
accurately by adopting some method of systematization. 

8. Induction, deduction, and generalization, i.e., the ability of the student to 
use given experimental data and form correct inductions, deductions and 
generalizations. 





530 The Journal of Educational Psychology 


9. Accuracy of understanding and of interpretation, i.e., the ability to grasp the 
true meaning of a given body of information and to interpret it correctly. 

10. Caution, i.e., the tendency of the student to pause to investigate before 
adopting a method of behavior. 


It is self-evident that most of these traits cannot be isolated without 
overlapping. The test elements, however, have been designed in 
such a way as to emphasize the outstanding traits, and to reduce 
overlapping to a minimum. The following samples of the Test of 
Scientific Aptitude will give an approximate idea of the nature of the 
test. 

DEFINITION (Four EXERCISES) 


Rank the following definitions of a bow according to merit, i.e., write 1 next to 
the best definition, write 2 next to the second best, etc. The poorest definition 
will receive the rank of 4. 


——A bow is a weapon used by primitive people, either in war or for hunting 
small and even large game by means of arrows. 
—— A bow is a piece of wood which, after having been bent into an arc is used 
for shooting arrows. 
——A A bow is a weapon well known in every country from time immemorial. 
——A bow is a weapon made of a strip of wood or other material, the two 
ends of which are connected by a cord, by means of which an arrow may be 
projected. 


The final form of the test includes the definitions of four terms. 
These definitions were ranked by seven judges selected from among 
the faculty members in science and engineering departments at 
Stanford University. Only those definitions were adopted upon which 
unanimity of the judges was secured. 


SUSPENDED vs. SNAP JUDGMENT (FIvE EXERCISEs) 


Put a check (Vv ) next to the correct answer to the question given 
below: 


1. What is the population of this country going to be in the year 3000? 
About 150 million; about 300 million; about 500 million; over 500 million. 
If unable to tell put a check here—— 


Obviously, no correct answer can be given to this question and 
those who have the tendency of suspending their judgment when data 
are incomplete will admit their inability to answer the question. 


5. A certain government, selling land, offered it on the following terms: 
1. If the buyer is an immigrant he may pay: 
$1000 every year for 20 years. 
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2. If the buyer is a native born, he may pay: 
$200 the first year 
$400 the second year 
$600 the third year and so on; the annual payment being increased by 
$200 each year for 20 years. 
3. If he is a war veteran he may pay: 
$1 the first year 
$2 the second year 
$4 the third year, etc.; the annual payment being doubled each year for 
16 years. 
Which of the buyers get the best terms? 
Answer here—— 
If unable to answer, check here—— 


In this case, the correct answer may be found by simple computa- 


tion; yet the number of those who were tempted to make a guess was 
quite considerable. 


DISCRIMINATION OF VALUES IN SELECTING AND ARRANGING EXPERI- 
MENTAL Data (Five EXERCISES) 


2. A physicist wanted to measure the length of a fine wire with precision; for 
this reason he measured it several times. Below are given the results of his 
measuring. 


I eee Uae ds culated adalble seueite 14.63 cm. 
ee oe ce ab Seu eeeeen FuSLES 13.13 cm. 
GEST Re TEES BS Ss SS A SZ 13.12 cm. 
IIS, © VAPOR SG SESS Fc 0 o Slee Ge ARN Bes a 13.14 cm. 
RD iota bile: Dn nee evens te aehapadadesh 13.13 cm. 


What is the probable length of the wire? Answer here——Obviously, the 
first measuring in this exercise must be disregarded by the individual tested. 


DISCRIMINATION OF VALUES 


4. You wish to find the increase in population of your home town between 
January 1, and December 31, 1924. The only data available are those given 
below. Check those facts only which are necessary to solve the problem. 


——The number of births which occurred in your town in 1924. 
~——The number of people who intend to leave town in 1924. 
——The number of people killed by accidents in your town in 1924. 
——tThe number of people arrived during 1924 and now living in your town. 
——The number of children born in hospitals in 1924. 

——The population of your town in January 1, 1924. 

——tThe number of people in your town who died of sickness in 1924. 
——tThe number of people in your town who died of old age in 1924. 
——tThe number of people in your town murdered in 1924. 

— —The number of people moved out of your town during 1924. 
——tThe number of children born in homes in 1924. 

——tThe number of guests in hotels in 1924. 

——The number of deaths which occurred in your town in 1924. 
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EXPERIMENTAL BENT (Five EXe&RCISEs) 


Suppose that you have plenty of leisure and the necessury means for meeting 
the situations described below. Check frankly the statement which comes nearest 
to the way in which your first impulse would lead you to handle the matter. 
(If you wish to be helped by this test you must be absolutely frank.) 

5. You wish to get the lowest possible temperature from a mixture of ice and 
salt, but found contradictory statements in two books as to the accurate pro- 
portion of salt and ice. 


(A) Take a proportion of ice and salt that is an average of those suggested by 
the books. 


(B) Mix ice and salt in suggested proportions and check the information given 
in the books. 


(C) Call up an ice-cream factory and secure the information needed. 
(D) Remarks. 


The test of this trait has been devised so as to detect not the actual 
experimental ability due to training, but the first impulse, which is 
actually symptomatic of a ‘‘bent.”’ It does not matter whether the 
answers to various questions of this test element are exactly applicable 
to real situations, that is, whether the individual would actually 
proceed in the way he indicated had he been placed in a corresponding 
life situation. What matters is that following his first impulse, he 
would be inclined to proceed in the way indicated by him rather than 
in any other way. Once an experimental bent is detected, the degree 
of experimental ability is but a matter of training, other things being 
equal. 


REASONING 


3. Four gears are connected (‘‘ in mesh’’) as shown in the drawing. Gear A 
has 240 teeth, B has 80 teeth, C has 60 teeth, and D has 10 teeth. 


(A) How many turns does D make when A makes one complete turn? 
Answer here 
(B) If you connect gear D to B without using C, will D rotate faster or 
slower than in case A? 





Answer here 
(C) If you place between C and D a fifth gear, will D rotate faster or 
slower than in case A? 





Answer here—— 
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4. There is a train leaving City A every hour (at the hour) and going to City B. 
At the same time another train leaves City B going to A. The journey lasts exactly 
10hours. Youtookatrainfrom A. How many trains did you meet on your way 
to B, counting the one that reaches A at the moment of your departure and the 
one that leaves B at your arrival? 


Answer here—— 
(The correct answer is 20.) 


FALLACIES AND INCONSISTENCIES 


1. The “Evening Star” correspondent writes from the City X: “‘A plan was 
offered to the City X, located on the shore of Lake Ontario, by which it was pro- 
posed to generate at low cost electric light and power for the vicinity. The method 
consisted in digging a deep pit in the lowest part of the shore at the bottom of 
which the plant was to be located. The cost of equipment for the plant would be 
relatively low as it would consist only of generators run by turbines to which the 
lake water would lead through a large pipe.” 

At the meeting of the council various reasons were given by the members either 
for or against the project. Check (X) any of the statements you would endorse, 
and (—) these to which you would object. 

(a) I am in favor of this project for the plant will be as efficient as any using a 

natural waterfall. 

(b) I oppose this project for the plan is impracticable. 

(c) I am in favor of this project for very cheap power could be generated by 

the proposed method. 

(d) Iam opposed to this project for such a plant would be unsanitary. 

(Of course (b) is the correct answer; as for (d) it should be checked —, for, 
obviously enough, a flooded pit would be more than unsanitary.) 


Read each of the following paragraphs. If a paragraph is consistent through- 
out, put an (X) in front of it; if it is not, put a ‘ails —). (You need no special informa- 
tion on the topics discussed below.) 

3. White light is a mixture of various rays, the “wave lengths” of which 
decrease as we proceed from red to violet. Rays of still shorter wave lengths are 
not visible to the eye. Among these, some rays, such as ultra-violet rays and 
X-rays find many applications in medicine. The X-rays, the wave length of which 
is approximately double that of the violet rays, are used in surgery. 

5. A body lighter than its volume of water will float in water. Sodium is 
lighter than its volume of water. Sodium is a metal. Metals usually sink in 
water. A chunk of metallic sodium thrown into water will float. 


InpucTION, DEDUCTION AND GENERALIZATION (Two EXERCISES) 


A scientist on a distant planet, in a universe different from ours, was trying to 
discover the law governing the behavior of gases. He took a certain amount of 
gas which occupied exactly 100 cubic feet and with a pressure gauge found the 
pressure of the gas was 1 lb. persq.in. He then compressed that same amount of 
gas toa volume of 50 cubic feet. At that moment the pressure gauge read 4 pounds. 
He proceeded then compressing the gas more and more. Below are recorded the 
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results of hisexperiment. (The temperature of the gas remained the same through- 
out the experiment.) 


Votume or Gas PRESSURE 
100 cu. ft. 1 lb. per sq. in. 
50 cu. ft. 4 lb. per sq. in. 
25 cu. ft. 16 lb. per sq. in. 
12.5 cu. ft. 64 lb. per sq. in. 


1. Under what pressure will the gas be reduced to a volume of 6.25 cubic feet? 
2. Supposing that the gas behaves as indicated above, what formula will 
express the general law governing the behavior of gases on that planet? 
Call V the volume of the gas and P the corresponding pressure. 
Answer here 





In connection with this exercise one need not be alarmed by the 
apparent violation of the Principle of Uniformity. Under the cir- 
cumstances such violence was harmless, and, moreover, it enabled 
us to devise experimental situations the laws of which could not be 
memorized in a high school physics text book. For what we intend 
to test through this exercise is not memory, nor even intelligent 
information, but the ability for forming correct quantitative 
generalizations. 


ACCURACY OF OBSERVATION 


The test consists in completing a geometric design so as to make it 
identical with another one. To those who have no acute sense of obser- 
vation and are unable to analyze systematically a complex situation 
into its elements, the exercise appears decidedly complicated. 


CAUTION AND THOROUGHNESS (EIGHT EXERCISES) 


The exercises are based upon various optical illusions, which have 
been arranged with the purpose of meeting two ends: First, to deter- 
mine whether the student is cautious enough to read the instructions 
carefully (as he is invited to do); second, whether he is thorough enough 
to carry out these instructions without being influenced by the appar- 
ent ease of the task or by faulty inductions. 


ACCURACY OF INTERPRETATION 


The correctness of interpretation is tested by a multiple choice 
set of questions based upon more or less technical material. 

In our investigation we felt justified in assuming, temporarily at 
least, that the process of creative imagination in science is generically 
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similar to, if not identical with, the processes involved in analogical 
thinking, reasoning, forming inductions, deductions and generalizations. 

If this assumption is correct, and it is at least more rational than 
those held by the mystically inclined, then with adequately selected 
test material, we ought to be able to detect the so-called creative 
imagination! even on higher levels of scientific aptitude. In other 
words, there ought to be an acceptably high correlation between the 
scores on the scientific aptitude test and actual aptitude as determined 
by means of a reliable criterion in order to test this assumption as 
well as to determine the validity of our test. 


WEIGHTING THE EXERCISES 


The relative weights of various exercises were determined by com- 
paring the responses of two groups of students, one composed of 50 
research students in the departments of physics, chemistry, and elec- 
trical engineering, (criterion group), the other of 121 senior and 
graduate students in the departments of English, history, languages, 
economics, and law (the non-scientific group). The comparison was 
made by means of 2 X 2-fold tables, the line of dichotomy passing 
through the mean scores of the scientific group. The coefficients of 
tetrachoric correlation (multiplied by 10) were then used as corre- 
sponding weights. These weights range from two to seven. 


THE CRITERION AND THE VALIDITY OF THE TSA 


We have adopted as our criterion the scores and ranks given by 
faculty members to a group of research students in the departments of 
physics, chemistry, and electrical engineering. 

Achievements of men, in whatever endeavor, can be estimated only 
approximately. In the world at large such an estimate is often incor- 
rect and unfair and may require long years, sometimes centuries, 
before the judgment of men is corrected. In a university, however, 
where a small body of research students is working under the direct 
supervision of competent men of science, gross misjudgment on their 
part is reduced to a minimum, and their verdict, while not infallible, 
is decidedly more reliable than that of any other group of judges. 





1 While, according to Jevons “even the definition of genius is out of the ques- 
tion,” physicists like Mach believe that ‘‘we shall hardly go astray, if we regard 
genius as a slight deviation from the average endowment.”’ The latter view is 
being corroborated by Terman’s experimental studies. (See Terman, L. M.: 
“The Measurement of Intelligence.’’) 
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It is then upon the degree of agreement between the estimates by 
such competent judges! of the scientific aptitude of research students 
in a given department and those based upon the results of our test 
given to those students that we intend to base the validity of our test. 
The precaution was taken to ascertain that every student was willing 
to give all the time necessary for the completion of the test; there was 
no set time limit. As a matter of fact, there was not a single case of 
defection among the members of this group, although taking the test 
was not mandatory. 

After the tests had been scored, the names of the students (without 
scores) in each department were left with two judges (faculty mem- 
bers best acquainted with the research group), and the latter were 
asked to rank their students, not according to their ability in research 
due to training, but according to the aptitude for science, due rather to 
endowment. As soon as the judges’ ranking lists were completed, 
their rankings were compared with those based upon the test scores. 
These two series of rankings were always prepared independently. 

Table I gives the results of our investigation for the criterion group: 


























TABLE I 
| . Mini- Maxi- Standard Correlation of 
| Num- ; TSA scores with 
Group mum mum | Mean’ | devia- 
| ber ; average scores 
score score tion . : 
| given by judges 
Chemistry ....) 21 45 159 134 | 30.2 77 .06 
Physics.......| 10 50 154 125 27 .2 .95 .02 
Electrical engi-| : 
neering...... | 76 166 138 23.2 .89 .03 
Whole criterion) 50 45 166 |- 134 27.3 .74 | .04 








It may be noted that the standard deviation of the criterion group is 27.3, while 
the standard deviation of an unselected group of 246 Stanford freshmen is 28.3, 
which suggests that the TSA detects aptitude rather than ability due to training; 
it also shows that the test is capable of differentiating aptitudes for science even 
among highly selected and relatively homogeneous groups. 





1 The judges were the following members of the Stanford University faculty: 
Professors Harris J. Ryan, and H. H. Henline of the department of electrical 
engineering; Professors J. G. Brown and P. A. Ross of the department of physics; 
Professors R. E. Swain and G. S. Parks of the department of chemistry, to all of 
whom the writer is greatly indebted for their wholehearted cooperation. 

2 The time taken by 413 Stanford students and faculty members varied from 
one hour to three hours with a mean of one hour 50 minutes, and SD equal to 26.4 
minutes. 

’These means tend towards 142 when the “misfits’’ are eliminated. 
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Moreover when the TSA was given to a group of Stanford Faculty 
Members (see Table II) in various departments of science and engi- 
neering, and their scores were compared with those of the science fresh- 
men group,' the standard deviation of the science faculty group was 
found to be 13.0 with a maximum score of 177 and a mean of 153, 
while the SD, maximum score, and the mean of the freshmen group 
were 27.9, 156, and 113 respectively; the main differences between the 


Disfribulion of T.S.A. Scores ofthe Criterion Group. 
(Science & Engineering) 
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two groups become obvious when we compare the minimum scores 


and the means. 


It is rather significant that the standard deviation 


of the faculty group is relatively high, which may be taken as an indi- 
cation that the test is capable of differentiating aptitudes among men 
of even so highly a picked group. 

The table below gives succinctly all these results: 











TaB.e II 
Minimum | Maximum Standard 
Number Group score score Mean deviation 
246 | Unselected freshmen.... 47 166 105 28.3 
50 | Research (criterion).... 45 166 134 27.3 
21 | Science faculty......... 133 177? 153 13 
14 | Non-science faculty.... 95 167 118 24 
79 |Science freshmen....... 45 156 113 27.9 
47 | Seniors and graduates— 
non-science........... 30 155 90 29.3 




















The above results may then be considered as a satisfactory proof of the validity 


of the TSA. 





1 Those intending to major either in science or engineering. 


2 Maximum possible test score = 185. 
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Statistically determined the validity of our test is:! 


T12 be 


4 
Tr ow = -___ = - ae 
Vru Vrou V.9 / .90 


TSA vs. INTELLIGENCE TESTS 








= .82 + .07 


It is reasonable to inquire, whether the test is not duplicating the 
function of other already existing intelligence tests, such as the Thorn- 
dike Intelligence Examination of the Terman Group Test of Mental 
Ability. The correlation with the Terman Group Test given to 38 
Stanford students ranging from freshmen to graduates and belonging 
to the Terman thousand was equal to .13 + .10. Although we may 
expect a higher correlation with a larger population, the low correla- 
tion obtained is significant enough to indicate that TSA has little in 
common with the Terman Group Test. 

The correlation between the Thorndike Intelligence Examination 
given to 53 Stanford students of the Terman thousand (including the 
above 38 students) was equal to .44 + .07 or after correction for 
attenuation to .50 + .07. With a group of 324 Stanford students 
including freshmen, sophomores, juniors, seniors and graduates from 





1 Kelley, T.: L.: “Statistical Method.” F 155a, p. 204, Macmillan, 1923. In 
the above formula, ri: is the coefficient of reliability of the TSA and ren is the 
average coefficient of reliability of the judges’ scores. 112 is the coefficient of corre- 
lation between the test scores of the whole criterion group and the average scores 
given by the judges. 

These three coefficients were determined as follows: 

The reliability of the TSA was determined by giving the form A in November, 
1925, to a group of 78 Stanford freshmen; of these 78 freshmen, 40 took the second 
form in February, 1926. Thus determined the coefficient of reliability is: 


ru = .90 + .02 


Corrected for range corresponding to a population of 377 (from freshmen to 
graduate students) it becomes: 
ri = .93 ~ .02 


The average reliability of the judges’ rankings and scores was determined by 
correlating their rankings and scores with those given by these judges to the same 
students from three to six months later. The average coefficient of reliability 
of the judges’ scores was found to be: 


Tal = .90 + .03 
The coefficient of correlation between the test scores and the judges’ average 


scores (based upon 100 per cent scale—the individual ranked 1 was to receive a 
score of 100) of the whole criterion group was found to be: 


Tig =..74 + .04 
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various departments the correlation between TSA and Thorndike 
scores was equal to .51 + .02 or after correction for attenuation to 
58 + .02.1 

This correlation is low enough to make us infer that the two tests 
measure to a considerable extent different attributes. To what 
extent they do so was not possible to determine, since only 10 members 
of the criterion group had taken the Thorndike Test. 


TaBLeE III.—CorRELATION BETWEEN TSA anp OTHER TEstTs 








Number 
Name of test Pe r r corrected 
Thorndike intelligence examination .. 324 .51 + .02 .58 + .02 
Thorndike intelligence examination .. 53 .44 + .07 .50 + .07 
Thorndike intelligence examination . . 31 .39 
Terman group test................. 38 .13 + .10 














The group of 324 Stanford students included those of freshman to graduate 
standing. 


The last three groups were made of members of the Terman Thousand Gifted 
Children now at Stanford Universtiy and included students of freshman to graduate 
standing. 


In order to compare the performance of our scientific (criterion) 
group with that of a presumably non-scientific group the test of Scien- 
tific Aptitude was given to 47 senior and graduate students of the 
departments of English, education, languages, law, and economics. 
The coefficient of correlation (product moment) between their TSA 
scores and their average scholarship scores was found to be: 


Trsasn, = -019 .091 (a) 


whereas the correlation of the scholarship scores of this group with 
their corresponding Thorndike scores is: 


Trac, = 41 .07 (b) 


This difference is decidedly significant. The Thorndike? mean 
score of the non-scientific group is 82.9 which is 6.8 points above the 





1 Extreme cases of very low Thorndike scores in cases of foreign born students 
were eliminated in spite of the fact that their TSA scores, as well as scholarship, 
were relatively high. Had these cases been kept, the correlation would have been 
considerably lower. 

2 Only 31 out of 47 had taken the Thorndike Examination. 
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average of the Stanford 1924-1925 graduating class. The SD of the 
Therndike scores for this group is 16.98, and the range is from 47 
to 117. The mean of the TSA scores of the non-scientific group' is 
86.9, z.e., about 1.6 sigmas below the mean? of our criterion scientific 
group, while the range is from 30 to 155, and SD equal to 29.34. 


TaBLE IV.—CoMPARISON OF THORNDIKE AND TSA Scores OF THE NON-SCIENTIFIC 











Group 
aie : Correlation 
Test Number + Minimum | Maximum Mean! SD with 
cases score score : 
| schoarship 
Thorndike.......... 31 | 47 177 82.9 SEE + .07 
NS Fe 47 | 30 155 86.9 |'29.34).02 + .09 














On the other hand the correlation between the TSA score and the 
relative scholarship of the scientific group (N = 50) is 


Trsa-sen: = -O1 + .07 (c) 


However, when the TSA scores are correlated with the average scores 
given by the judges to the members of the scientific group on the basis 
of the student’s aptitude for science, we find then 


T12 = 74 + .04 (d) 


It would be interesting to correlate the Thorndike scores of the 
criterion group with the TSA scores as well as average scholarship 
of the science (criterion) group. Unfortunately, as stated above, only 
10 members (out of 50) have taken the Thorndike Examination. 
We may, however, secure a fair idea of the order of the correlation 
between average scholarship and Thorndike Examination scores for a 
similar group from the Faculty Bulletin issued by the Registrar’s 
Office at Stanford University.* We find there the following correla- 





1In the English Department the highest score was secured by a girl, although 
several men in the same group had more thorough training in science, one being a 
former physics teacher. 

2 The mean of the criterion group not excluding the ‘‘misfits” (lowest 10 per 
cent of the group) is 132. 

* Bull. No. 1, Dec., 1925. 
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tions between the Thorndike scores‘and the relative university scholar- 
ship average: 














TABLE V 
Group Number of cases Coefficient of 
correlation 
pT ry Yr Oe 107 .27 + .06% (e) 
I ohn swine andes busedthin ott 189 .53 + .04 
is i a 177 .388 + .04 








1 It should be noted that the scholarship scores of the engineering group covered 
academic as well as engineering subjects extending from the freshman to the 
senior year. Had the scholarship average been based on the senior year alone, the 
correlation, according to K. M. Cowdry, Director of Personnel Research at 
Stanford, would have been sensibly lower. 


The comparison of (a) with (b) on the one hand, and (c) or (d) 
with (e) on the other, is indicative of the difference in nature of the 
two tests. 

The correlations between the Thorndike Intelligence Examination 
Scores and average university scholarship in non-scientific departments 
is considerably higher than in the cases of science or engineering depart- 
ments, whereas the reverse is true of the Test of Scientific Aptitude. 
This may be visualized from the table given below: 


TaBLeE VI 





Correlation 
between Thorn- 
dike Examination 


Correlation | Correlation 


between TSA|between TSA 
Group 





and and judges’ 
scholarship scores origin and 
scholarship 
Scientific (engineering) (physical 
SN es .50 + .07 .74 + .04| .27 + .06 
Non-scientific (law, economics, 
ED s.0c a oie 64.09.0040 64005 .02 + .09 .53 + .04 (econ.) 


.38 + .04 (law) 














DIAGNOSTIC AND PROGNOsTIC VALUES OF TSA 


The diagnostic value of the test may perhaps be best judged both 
by cases of complete agreement and those of sharp disagreement 
between the judges and the test. 































542 The Journal of Educational Psychology 


The cases of agreement are numerous and may be detected by 
mere inspection of the tables given above. However, let us turn to a 
few cases of decided disagreement. 

Student P. E. considered by both judges C and D as having “‘bril- 
liant”’ aptitude for science and preparing to start on a research problem 
in physics received on the test of Scientific Aptitude a score of 110, 
which is over one sigma (= 28) below the average (M = 142) of the 
Criterion Research Group. He came out especially low on the test of 
“experimental bent’’ and on “discrimination of values in selecting 
and arranging experimental data,’’ as well as on “‘accuracy of observa- 
tion”’ and on “caution.” 

Soon afterwards this student started on a research problem under 
judge D, who at the end of the quarter decided that the diagnosis of 
P. E. by the Test of Scientific Aptitude was correct and that he had no 
aptitude for experimental science. An interview with P. E. who, 
intellectually is a brilliant young man, proved to the writer that P. E. 
had no ‘‘interest’’ in research. On the other hand P. E. seems to have 
a bent for social service and, quite probably, will achieve in that field 
far more than he ever would in science. 

Not less significant is the case of the student E. R. who was just 
starting on a research problem in chemistry. He was graded “‘B’’! 
in aptitude for research by two members of the department, yet his 
score on the test was 72, 7.e., about 2.5 sigmas below the mean of the 
criterion scientific group (M = 142). A careful examination on his 
test showed plainly a decided lack in clarity of reasoning, as well as 
inability to form inductions and generalizations in relatively simple 
experimental situations. The shortcomings of E. R. in simple situa- 
tions, such as found in the test, were so glaring that unless the student 
were ill or otherwise indisposed, the probability was high that he had 
little or no aptitude for science. 

The matter was then taken to a third member of the department, 
Judge F, who stated that ‘‘E. R. is a poor, slow reasoner” and “has 
no scientific future.’””’ The discrepancy between the judges as well as 
between the judges and the test was important enough to justify fur- 
ther investigation. The low grade scored by E. R. might well have 
been due to some extraneous cause such as temporary indisposition, 
lack of interest, andsoon. An interview with E. R. proved, however, 
that none of these causes contributed to the result of the two. He told 





1A = excellent; B = good;C = average; D = no research aptitude. 
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the writer that he was “all right’”’ while taking the test and did his 
best to complete it. After some hesitation E. R., who is smooth, 
sociable, and bright, frankly admitted that he “‘always had difficulties 
in forming correct inductions from experimental data” and ‘‘was 
always poor in solving problems,” and that, besides, he “‘ was not inter- 
ested in research, but was planning to go into business.” Although 
such an admission is usually significant, the writer preferred to wait 
until the quarter was over, during which E. R. was given an oppor- 
tunity to carry out his research problem. The result, according to 
the department, was a complete failure, and E. R. left the university. 

On the other hand there are cases of disagreement in which no such 
corroborating proof can be brought forth. Thus, student 8. M. came 
out with a rank 1 on the test, while the average rank given him by 
the judges is 9 (out of 21). While the discrepancy is not crucial, it is 
quite considerable. 

To account for it, the writer interviewed both S. M. and some of 
S. M.’s fellow students. According to them 8. M. was “out of luck”’ 
in selecting his research problem and, in spite of his experimental 
ability, could not get any appreciable results. It must be added that 
he was previously considered by Professor F “‘the best man in the 
department.”’ S. M. himself was somewhat depressed both by the 
negative results of his research as well as by some other circumstances. 

These cases of sharp disagreement between the estimates of the 
judges and those based upon the test are very few and, for the time 
being, it is difficult to decide which of the two estimates is the more 
reliable one. The general indications, as may be seen from the above 
discussion, seem to be in favor of the test. 

The most probable standing Yo of a student whose TSA score X 
within the range of scores of either our criterion or freshmen groups 
may be estimated from the regression equation :! 


6 
Yo —_ Mo = g rou(X1 —_ M);) (1) 


in which X;, M,, 61, are the score, the mean and the standard devia- 
tion of the TSA scores of our criterion group, and Yo, Mo, 6, are the 
score, mean and standard deviation of the judges’ average scores of 
the students’ scientific aptitude; ro: is the coefficient of correlation 
between the two series. 





1 Kelley, T. L.: “Statistical Method.” f. 91, p. 161. 
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For practical purposes, however, the estimate of the most probable 
aptitude for science on the basis of the TSA score may be obtained by 
determining the corresponding true score.! 


X = my Xi + (1 — ru) M1 


The probable error of estimate of a true score by means of a single 
TSA score is: 


PE.t.x = .67450...1 = 3.86 or, say, 4 


We may then infer that the chances are even that a score, say of 
115 on TSA will correspond to a true score lying between 111 and 119; 


while we may say with fair certitude that the true score will lie between 
115+3PE 7.e., between 103 and 127. 


Norms 


The study of scores obtained by various groups such as our criterion 
scientific group, and freshman group, etc. shows that there is a decided 
variation from the scientific to the non-scientific groups. 

While the TSA means in the departments of physics, chemistry, 


and electrical engineering vary from 123.9 to 138.3 the mean of the 
whole criterion group and its SD are: 


M, = 134.4 + 3.9? 
SD, = 27.4 + 2.78 


The mean of the criterion group after elimination of the “misfits” 
is 142. 


The mean and the SD of the TSA scores of 47 seniors and graduates 
of the non-scientific group are: 


M,, = 89.7 + 4.3 
6,, = 29.3 + 3.2 


The mean and SD of the unselected group of 324 Standford students 
are: 
M = 105.8 + 1.05 
SD = 28.3 + 1.1 


If the norm should be placed tentatively at the lower quartile of 


the scientific group, say 115, such a norm would be 0.9 SD above the 
mean of our non-scientific groups. 





1 Kelley, T. L.: “Statistical Method.” f. 168, p. 214, Macmillan, 1923. 
? Kelley, T. L.: ‘Statistical Method.”’ f. 29, p. 83. 
3 Kelley, T. L.: ‘Statistical Method.” f. 32a, p. 86. 
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Whether it is the lower quartile rather than the mean of the 
criterion group of some intermediate point that should be chosen as a 
norm for the differentiation of those who show aptitude for science 
from those who do not, is a matter of further experimentation. More- 
over, these norms may be made to vary for the different scientific and 
engineering departments according to the needs and standards of a 
given institution. More extensive application of the test is necessary, 
before such norms may be established. 


ResuMg& 


1. Since no single test can embrace all the complex mental and 
character traits of an individual, the problem of intelligent differentia- 
tion and guidance of incoming college students must be approached 
from several sides by means of objective reliable and valid tests of the 
most important of these traits. One of these traits is aptitude for 
science or engineering. 

2. After having analyzed scientific aptitude into its probable 
components, corresponding test elements were devised and adequate 
weights attached to them. 

3. The Test of Scientific Aptitude was given to a group of 50 
research students in the departments of physics, chemistry and electri- 
cal engineering at Stanford University, and their scores correlated with 
those assigned to them independently by competent judges in each 
of the departments. The correlations (product-moment) found were 
.95 + .02, .77 + .06 and .89 + .03 respectively. 

4. The validity of the Test of Scientific Aptitude was found to be 
82 + .07 and its reliability .93 + .02. 

5. The correlations with the Terman Group Test of Mental 
Ability and with Thorndike’s Intelligence Examination range from 
.14 + .10 to .58 +.02. 

6. The test was also given to senior and graduate students in non- 
scientific departments (English, history, languages, etc.) and the cor- 
relation of TSA scores with the students’ scholarship during the senior 
year is .019 + .091, while the correlation of Thorndike’s scores with 
scholarship for the same group is .41. On the other hand the correla- 
tion between university scholarship during the senior year and the 
TSA scores of a group of Stanford research students in science or 
engineering is .51 + .07, while a similar correlation between Thorn- 
dike’s Examination scores and the average university scholarship of a 
group of engineering students is .27 + .06. 
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The above data corroborate the conclusion that the Test of 
Scientific Aptitude is different in nature from the Thorndike Intelli- 
gence Examination and similar intelligence tests. 

7. The diagnostic and prognostic values of the TSA have been 
shown both through cases of agreement and cases of disagreement with 
the judges. Moreover, the comparative study of the responses of 
graduate research students and faculty on the one hand, and of 
unselected freshmen on the other, definitely shows that the TSA is 
a test of aptitude rather than of training, and is capable of differentiat- 
ing scientific aptitude among highly selected and trained groups. 
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A STUDY OF THE IMPROVABILITY OF FIFTH GRADE 
SCHOOL CHILDREN IN CERTAIN MENTAL 
FUNCTIONS 


ESTHER HURLEY DE WEERDT 
Department of Education, Beloit College 


Note 1.—This is a report in part of the findings in a study submitted 
in partial fulfillment of requirements for a Ph.D. degree at Yale Univer- 
sity in 1923. In this connection the writer desires to pay tribute to the 
late Professor J. Crosby Chapman as an inspirational teacher, able 
counselor and zealous inquirer after truth. 

Note 2.—Copies of the complete data and the tests are on file in the 
library of Yale University. The tables of correlations in this article 
give in all cases the raw Pearson coefficients. 

The whole scheme of formal education is based upon the funda- 
mental concept of improvability. The educator has always been 
interested in this capacity of the individual and has measured it in a 
more or less direct way through class achievement. As compared 
with other educational data we have, however, relatively little statis- 
tical material on the learning or improvability of children under class- 
room conditions. The reasons for this condition are possibly quite 
obvious. Any extended experimentation takes valuable time from an 
already over-crowded public school curriculum. Improvement data, 
while rapidly amassed, easily become a complicated problem to handle, 
especially because of the danger of the unexpected and uncontrollable 
elements which are likely to enter into any learning experiment. 
Epidemics of measles, whooping cough, etc., frequently break in upon 
and render useless the data of a study made over an extended period 
of time. While this study was necessarily limited by the first reason 
mentioned, it was satisfactorily completed without any unfortunate 
mishap. 

Intelligence tests are constructed to indicate how much the child 
has acquired from general contact with his environment. They are 
tests of improvability, the basic assumption being equal opportunity 
within a given school group. The environment being reasonably con- 
stant, the results of such tests are believed to give information by 
inference concerning that quality termed native intelligence. Any 
one who has considered the evaluation of the individual in the class- 
room, even in a so-called homogeneous group, knows that the con- 
stancy of environment is a debatable question. Apart from objective 
547 
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tests, we customarily judge the intelligence of the individual by his 
facility in learning in a given situation, 7.e., his improvability. Since 
all education is directed toward the improvement of the individual, it is 
quite possible that a more direet method than the intelligence test 
for investigating the capacity for improvement under school condi- 
tions may be found. 

The proper classification of the child as to capacity for further 
improvement is of great social and economic consequence and of vital 
importance to the child and to the school. The habit of failure and 
the unfortunate attitude of the misfit must be eliminated by the school 
if possible. Circumstances entirely apart from native capacity such 
as poverty, poor previous instruction and insufficient motivation, may 
bring a one-time result, such as is obtained from the intelligence 
test, which is entirely in disagreement with the capacity for improve- 
ment. The suggested finality of the score on the intelligence test often 
governs all too strongly the attitude of the teacher toward the child. 

This study was undertaken in the hope of gaining some information 
of value concerning the general problem of improvement as it must be 
met under school room conditions with children as subjects. 

Subjects.—An unselected group of 49 Grade V children in an Ameri- 
can section of the city of New Haven, Conn., was chosen for this 
investigation. In the local school system Grade V seemed to be more 
nearly free from the problem of retardation that appeared in the upper 
grades. The average chronological age was ten years. Forty-five 
complete records were obtained. Illness eliminated two, and lan- 
guage difficulties the other two subjects. 

Tests and Method.—Since the school curriculum does not admit of a 
large amount of time being devoted to an investigation of this type, 
tests were chosen for the practice series which, while reasonably com- 
plex, would show improvement over the short time available. These 
tests, with the exception of cancellation, are also fairly representative 
of those found in a group intelligence test. The usual care was given 
to the exact administration of the practice tests. The children were 
taught to consider the practice period as much a part of the day’s 
work as the daily assignment in arithmetic. The daily program of 
work with the total practice time of each test as administered in the 
order listed follows: 


Practice period.............0..5- 11 school days. 
Daily period of work............. 9:00-10:00 A. M. 
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Torau Timz 


TxrstT PER Test 

1, oeemtiam Caen GIA ois iin cise ois os co dee eee stiesnee 44 minutes 
9, ee n.d. whb as obit cle eiemebiad 55 minutes 
3. Reading (Chapman-Cook speed test)..................00eeeee 16 minutes 
4. Cancellation (Woodworth-Wells)...................cccceeeeee 33 minutes 
5. Multiplication (arranged for this study)....................... 55 minutes 
6. Same-opposite (arranged for this study) ...................... 13 minutes 
7. Multiplication by substitution (sheet used by Thorndike and 

SR le nin wt innate s anes dhabnnbens Came ee é 44 minutes 


Bek ann oo os eandenane Wek eeheedn kts 260 minutes 
Various devices were used for motivating the group such as the 
daily addition to the approximate class graph of achievement drawn 
on the board, the daily posting of individual scores, competition within 
the group, etc. The tests were not discussed with the children and no 
specific practice was given by the teacher outside the practice hour. 

Results of Practice—The efficiency of the 45 children in these seven 
functions was measured on 11 consecutive school days. A survey of 
the measures indicates the usual irregularities between the scores of 
the first and the second day. This probably cannot be ascribed to 
practice effect, but rather to the failure to understand directions on the 
part of a few, unfamiliarity with the investigator, etc. Such cases 
cause the first day’s score, especially, to be more erratic than that of 
any other. 

The daily scores of each child were tabulated and are summarized 
in Table I by averaging the initial and final scores and determining 
the gross and percentage improvement. The score of the second day 
is used as the more reliable measure of initial ability. 


TABLE I.—AVERAGE INITIAL SCORES AND GROSS AND PERCENTAGE IMPROVEMENT 








Canin Percent- 
Initial | . age Practice 
improve- |. 
score improve-| time 
ment 
ment 
4 bbws Kn bebe obs +2 50.009 67.6 74.0 109 44 
EEE kc aie deuce. aus 0d ee 062d vo 64a 12.0 4.4 37 55 
PE aha. ah coisa Shwe th ec8eano% 6.5 21.0 323 16 
RE, 6 0.55 sate Won.e Seontb wes 121.5 69.6 62 33 
Multiplication....... 84.6 40.7 48 55 
et wea te ae awk dyn 16.3 18.9 116 13 
Multiplication by substitution.......... 24.0 12.4 52 44 
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This very large specific improvement over the short period of 
practice is of great interest to the educator as indicating the possibili- 
ties in short drill periods. Even in such a well practiced function as 
addition, observation of the methods of the children indicated that a 
brief period of drill in number combinations would have increased 
very markedly the substantial improvement noted in Table I. Miulti- 
plication, also continually in use, shows a marked gain of 48 per cent. 
When short periods of special practice can produce this amount of gain 
in such specific and limited tests, some idea is given of the possibilities 
of short drill periods in the more complex school fundamentals. 

Using the total achievement of each child as a standard, the daily 
scores were correlated with those of the standard as a means for esti- 
mating the stability of these scores. The results are indicated in 
Table II. The coefficients for each day were summated and the aver- 
age taken for the seven tests. 


TABLE II.—CorRRELATION BETWEEN TOTAL SCORES AND Daly Scores 1n Eacu 
































TEST 

Days 1;/21/3/4/s5/7/;/9/11 
Substitution..........................{.60 |.64 |.77 |.81 |.91 |.89 |.91 |.92 
SRE SSE Repeal .85 |.78 |.94 |.92 |.94 |.95 |.95 |.95 
Sc. Ob ous eee sa saueaeeen .79 |.88 |.93 |.95 |.96 |.97 |.97 |.97 
ER eer re .75 |.88 |.87 |.90 |.88 |.89 |.94 |.87 
EE NTEETCE OE .82 |.88 |.93 |.93 |.94 |.95 |.93 |.95 
NG i oink 5 5 cccbibin'n bala ae .64 |.81 |.82 |.77 |.82 |.83 |.90 |.89 
Multiplication by substitution.......... .69 |.84 |.88 |.84 |.88 |.83 |.88 |.83 
SG oo 5 id ds nea vs tadddas ae | 73 |.82 |.88 |.87 |.90 |.90 |.93 |.91 
ii ald satis Uae conan een teee '-033| 031 023, .024| .017| 020.011) .018 





The practice results of the first day as indicated in Table II should 
evidently be considered as preliminary practice or pre-drill. A 
comparison of the first day with the other days in Table II argues 
strongly for the use of the pre-drill period in any test where a stable 
score from a group at one sitting is desired. While much criticism 
has been directed against the time and material consumed in pre-drill 
in group tests, these comparative results substantiate the belief in the 
need of such pre-drill. In tests where no pre-drill is given, the scores 
obtained are evidently a measure of the individual variability in 
“getting set’? to the problem combined with the amount actually 


achieved and not of actual present ability in the test at that particular 
time. 
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If practice tended to affect all equally or in direct proportion to the 
respective initial scores, the initial score would be a reliable indicator 
of the permanent rank of the child in his own group. Table III shows 
the relation between initial and final scores in the 7 practice tests. 
The tendency for the subject to keep a certain rank evidently varies 
with the function tested. The high correlation between initial and 
final scores in reading is of interest. This test is quite analogous to 
many school functions where short-circuiting takes place through 
memorization. In general the table indicates that change takes place 
in the ranking of the individuals within the group. It is not possible 
to predict with a high degree of accuracy the relative final ranks of the 
subjects on the basis of their initial scores. 


Tas_e III.—Correvation or [n1iT1aL Scores witH IMPROVEMENT (11 + 9) — 
(2 + 4) anp Fina Score (11 + 9) 


Improve- FINA 
MENT! Score 


NS. oo iil ddicy nis Hostlth eae obicebadaeunebvans .26 .68 
EN oc GAS 64 oy SREN EN 6 oo dS be bs co ckb ba ec bem eeeren .39 .84 
aL this 60 SUR ae Od Soh ch ebak seis acedbindebeomrebee .78 .93 
ie elke lewcaccdd  (eebeeeeeeuel .14 .79 
Rs hie ee Beis hth hs we us owned bu Baeee KOA .O1 .83 
i iia iad hase nd en vibes waa ahi — .006 .70 
Multiplication by substitution.................02 0.00 e eee — .22 .72 


1 Improvement is measured by the difference between the sum of the scores 
on days 11 and 9 and days 2 and 4 for the purpose of gaining greater stability. 
Composite scores are similarly indicated in other Tables. 


Table III also indicates that a subject scoring high initially and 
keeping his rank relatively well in his group does not necessarily make 
a comparable gain in actual units of improvement. The correlations 
between initial scores and amount of improvement differ with the 
functions tested and also differ decidedly from the correlations between 
initial scores and final scores. Weighting the final increment gained 
by individuals practiced close to their physiological limit would 
undoubtedly give high correlations but would only be another way of 
indicating what is clearly evident here. 

Because of the growing dissatisfaction with correlation as a means 
of comparison and the difficulty in the interpretation of results so 
obtained, the data were further analyzed by the use of critical sec- 
tions. By arbitrary selection of the ten scoring highest initially in 
each test and comparing them as to improvement with the ten scoring 
lowest initially, the results in Table IV were obtained. These averages 
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show a pronounced difference between the gross improvements of the 
upper ten and the lower ten in each test. The relation between the 
average initial abilities and the average improvabilities of the two 
groups is expressed in terms of a percentage or ratio. 


TaBLE I[V.—AVERAGE TEN HIGHEST AND TEN LowWEsT INITIAL SCORES WITH THEIR 
AVERAGE IMPROVABILITIES IN Eacu OF THE SEVEN TESTS 









































Ratio! 
Average 10 highest Average 10 lowest 10 Lowest 
10 Highest 
Initial Improve- Initial Improve- : 
SR, [ares —| ee, fares —|| ome! | TEESE™ 
Substitution........ 199.9 124.8 114.4 98.5 57 79 
Addition........... 38.3 10.3 .2 2.2 45 21 
ET ocanodevas 269 .2 458 .0 109.1 238.9 41 52 
Cancellation........| 307.8 130.0 169.0 101.0 55 78 
Multiplication. ..... 265 .2 54.5 118.3 54.1 45 99 
Same-opposite...... §12.5 328 .6 266 .6 320.4 52 97 
Multiplication by 
substitution....... 69.5 14.6 || 40.3 21.4 58 147 
Average per cent for | | 
Cee eee | Gs dia El pies 50 82 











1 The ratio is measured by the average initial scores and improvements of the 
ten lowest divided by the average initial scores and by the improvements respec- 
tively of the ten highest. 


This table indicates, taking substitution as an example, that the 
ten lowest achieved initially but 57 per cent as much as the highest. 
Their improvement, however, is 79 per cent of that of the ten highest. 
The results in this table are quite in agreement with those obtained 
by Thorndike, Race, Kirby and others. There is a positive relation- 
ship between high initial ability and gross improvement but, owing 
to the law of decreasing returns, improvement expressed in gross 
measures is not in direct proportion to the respective initial scores. 
This accounts in part for the low correlations of initial scores and 
improvements generally found. It is also evident that this apparent 
positive relationship between initial ability and improvability may at 
times reverse and those scoring low improve more in actual number 
of units. If the test is simple, certain of the group may achieve their 
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limit almost at once while those initially low have a large opportunity 
for improvement. 


TaBLE V.—CORRELATION BETWEEN INITIAL Scores (2 + 4) 








| Sub- Addi- | Read- Can- | Mul- |Same-|} Multiplica- 

| stitu- pang ye cella- | tipli- | oppo- tion by 

| tion | 8 | tion |cation| site | substitution 
Subatitution..........) .... .24 |—.07 .46 57 .30 .65 
pO EE. Fre wack oe 11 |-—.17 .54 .12 .20 
re, ee ine ... | —.08 .08 41 .03 
Cancellation..........] .... ee cath sal 22 .30 .52 
Multiplication........ ee ne ee van ies 17 .59 
Same-opposite........) .... rae cal oe me bas .08 
Multiplication by sub- 

GS ss 5 6s fo 0 


























The initial ability of an individual represents his improvement 
from zero efficiency until the time of measurement. It is apparent 
from Table V that the initial measure of one function does not neces- 
sarily imply a like ability in another. If the measure of the improv- 
ability in a function is desired, that function must be measured 
specifically. The coefficients between initial abilities range from a —.17 
to a .65. These varied coefficients agree with the results obtained by 
McCall (’16)! and Brown (’11).?. The initial ability measure in this 
practice study is analogous to the one measure obtained in any intelli- 
gence or educational test. The degree of the value of the intelligence 
test especially, it would seem, depends upon the wide sampling of 
abilities if one cares to draw an inference as to general ability. 

_Equal practice over a short period does not bring each individual 
to anything like equal final ability in all the functions. Table VI 
shows but little closer relationship existing between the tests than at 
the beginning of the practice. Factors such as special interests, 
fatigue, lessened zeal, etc., tend to enter as the practice period is 
prolonged. The capacity for improvement is not the only factor meas- 
ured in the final score. 





1 McCall: Correlation of Some Psychological and Educational Measures. 
T.C., Columbia University, Contributions to Education No. 79. 

2 Brown: Some Experimental Results in the Correlation of Mental Abilities. 
British Journal Psychology, Vol. III, p. 296. 
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TaBLE VI.—CoRRELATION BETWEEN FinaL Scores, MEASURE OF FINAL Score 
(11 + 9) 
Sub- Addi- | Read- Can- Mul- Same- Multiplica- 
sti- tion | ine cella- tipli- oppo-| tion by 
tion tion |cation| site | substitution 
Substitution.......... ner 31 .30 41 .46 .52 .57 
eS cis 6-4 ase ced = .08 |—.08 .72 .14 44 
SS ia a .07 | .10| .67 24 
Cancellation.......... faue ot or Lee 01 .34 .45 
Multiplication........ ee pape)" ih me: bua .23 .57 
Same-opposite........ ides pad ei sed re 7 .39 
Multiplication by sub- 
SUGURIOR. . 0. cces 


























The improvements of the different tests were correlated to deter- 
mine the presence or absence of a common factor in an individual 
which might be termed general improvability. These coefficients 
ranged from —.09 to .54. An analysis of the learning situation leads 
to a belief that uncontrollable factors such as likes and dislikes for 
particular phases of work, momentary fatigue, moods, etc., affect the 
improvement measures. Supposed lack of specific capacity for cer- 
tain learning elements has frequently been disproved in school and in 
practical life by the use of special motivation. 

On the day preceding the practice series, the Illinois Examination I, 
Form I, was given to the grade. This same form of the examination 
was again given on the day following the practice series. All compari- 
sons of the practice data were made with the averaged results of these 
two tests because of the greater stability gained through the use of 
the twoscores. The children were entirely unfamiliar with the method 
of procedure since no intelligence test had previously been given the 
grade. The correlations between the first and second test results 
were: 

i hd. 6s a's aie b owe ee 40 aude .84 
gg ge EE Ss Say Oar a 77 


Inasmuch as the IQ is the statement of the ability shown in the 
test in proportion to the chronological age, the gross scores as measures 
of present ability were selected as the logical measures with which to 
compare the improvement scores of the practice series in Table VII. 

If the group intelligence test measures general ability or the capac- 
ity of the individual to improve from zero efficiency to the time of the 
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TaBLE VII.—CorrRELATION oF AVERAGE Gross Scores (ILLINOIS Tests 1 AND 2) 
INITIAL Fina 


Scores Scores IMPROVABILITIES 

(2+4) (11+9) (11 +9) — (2+ 4) 
ES kc BU ob SoU ets veda ede sees 01 .14 .19 
pO ERR es ee eee .19 .27 .28 
AS i sn 3 aphid a a Wks dew 0d 00 .68 .63 .56 
ETRE aE ary oe een —.12 —.14 — .07 
cc nihcae cain eGaahanaecés .06 .06 .02 
I acc vee os na eben caceec ces .49 45 .18 
Multiplication by substitution............. —.07 —.07 .20 


test, it would seem that the capacity for specific improvement over a 
short period of intensive practice should bear a decidedly positive 
relation to general ability. Both are measured by the product which 
is supposed to be indicative of the common basic factor. Table VII 
does not show the high correlation one should expect. The use of 
critical sections of the group, however, give results which are decidedly 
more significant than the correlations. The ten highest IQ’s and the 
ten lowest IQ’s in the Illinois Test were selected and the respective 
improvements in the seven tests averaged and tabulated. The ratio 
of the improvement of the ten highest or ‘‘ brightest” to that of the 
ten lowest is given in the percentage column in Table VIII. In gross 
amounts of improvement the “‘ brighter” group is superior to the lower 
group in 6 of the 7 tests. The one exception is cancellation. This 
last is what observation has taught one to expect in a function so 
mechanical as the crossing out of figures. 


TasBLe VIII.—Averace ACHIEVEMENT OF TEN HiGHEST AND TEN Lowest IQ’s 
IN Iuuinois Test, 4ND THEIR IMPROVABILITY IN Eacu OF THE SEVEN 





TESTs 
Ratio 
(AveRaGeE ImpRov- 
ABILITY TEN 
IMPROVABILITY OF IMPROVABILITY OF LowssrT) 
AVERAGE 10 AVERAGE 10 (AvEeRaGE ImpROv- 
HIGHEST LOWEST ABILITY TEN 
(11 + 9) — (2 + 4) (11 + 9) — (2 + 4) HIGHEST) 
bets nes ddne 144.8 99.3 69 
ASE ee 10.9 2.1 19 
I Ait os webs ded ene 414.9 303.3 73 
CII occ cece dedeces 120.9 128.8 107 
Multiplication.............. 57.1 44.1 77 
Same-opposite.............. 324.8 248.4 76 
Multiplication by substitution 22.7 15.9 70 
Average per cent for the seven tests.............ccceeeeseees 70 


Table IX, where the gross scores of the Illinois Tests were used, 
shows practically the same results as Table VIII. Both tables indicate 
a decided positive relationship between general ability as measured 
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by the Illinois Test and improvability. A comparison of these results 
with the correlation tables indicates the inadequacy of unweighted 
measures. The increment of work possible during a practice series 
is much larger in point of actual units to be achieved for those who 
score initially low than for those who score initially high. A very few 
individuals within a group who achieve large increments in comparison 
with their originally low initial scores or those initially high who 
achieve small increments because of being near their physiological 
limit, exert a disproportionate influence upon the results in the use 
of a correlation formula. With the additional factor of favorable or 
unfavorable attitudes toward certain tests which are like tasks previ- 
ously met, the correlation formula as a means of stating relationship 
incorporates more than the simple relationship between actual content 
of the tests involved. The appeal of possible achievement in the 
simple function of cancellation brings about an actual reversal of the 
relationship in Tables VIII and IX. It may be inferred that such a 
simple task was a real challenge to the lower ten while the “ brighter” 
subjects actually failed to achieve their limit because of boredom. It 
is obvious, then, that while fairly high correlations between tests are 
obtained under good motivation, the results may be radically different 
under other conditions, 7.e., indifferent motivation, different type of 
subjects, etc. In other words, correlations are extremely uncertain 
as to interpretation when conditions attending the test approach any 
degree of complexity. 


TaBLE I[X.—AVERAGE ACHIEVEMENT OF TEN HIGHEST AND TEN LOWEST IN 
Inuino1s Test; AVERAGE Gross ScorES, AND THEIR AVERAGE IMPROV- 
ABILITY IN EacH OF THE SEVEN TESTS 


RaTI10o 
(AveRAGE ImpRov- 
ABILITY TEN 





Averace 10 Averace 10 Lowsst) 
HIGHEST LowEsT (AvERAGE ImpRov- 
IMPROVABILITY IMPROVABILITY ABILITY TEN 

(11 + 9) — (2 + 4) (11 + 9) — (2 + 4) HI1GHEstT) 
SE Ee 135.8 125.5 93 
SRE ER SR Ca 11.1 4.6 41 
PP tc iesdeeheenpanekn 399 .6 287 .3 72 
in ss a noes Able be 114.5 138.0 120 
Multiplication.............. 61.2 60.2 98 
Same-opposite.............. 289.8 271.3 93 
Multiplication by substitution 23.3 17.4 75 
Average per cent for the seven tests...............cceeeeeees 85 


It is evident, then, that the general test of intelligence does indi- 
cate capacity for improvement but the general test does not indicate 
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how much improvement we may expect in a specific function. This 
at once suggests the practicability of measuring the capacity for 
improvement directly in the specific function rather than attacking it 
by inference through a general test. In addition, for example, there 
was no indication in the general test that the ‘‘ brighter” pupils would 
be able to make such a decided improvement over the “‘duller’’ group. 
Five minutes drill a day for eleven days gave this information. A 
scanning of daily results indicates to the experimenter that any teacher 
could have arrived at exactly the same conclusion with these subjects 
at the end of the fifth day. 

Conclusion.—The complexity of even a so-called simple learning 
experiment has been more or less recognized. The factors that may 
enter to a greater or less degree are intelligence, interest, pride in accu- 
racy, determination to do well, persistence in spite of discouragement, 
attitude toward the teacher, physical conditions both subjective and 
objective, etc. The teacher desires to know what her particular group 
can achieve under her personal leadership in her room in specific 
functions in which she is interested. These results suggest that short, 
rapid-fire drills would give any teacher a direct measure of possible 
specific achievement. The results also suggest the value of the speed 
test as an actual learning device. 

If school opportunities have not been given to a child, experience 
has shown that his score on a group intelligence test is low. This 
does not indicate that he is not a fit subject for improvement, although 
a strict interpretation of his IQ might perhaps put him among those 
considered dull. If a child’s score on the test is high, it may be con- 
sidered as a measure of his ability and opportunity, even though the 
attitude toward school work in general may leave him among the 
average pupils. 

The results of this study as well as actual classroom experience 
show that other measures of the individual beside the one-time 
test are needed for classification purposes. The practice test is a 
dynamic test, a measure of the ability toimprove under specific training. 
Unusual feats or ‘‘spurts” such as may appear in a one-time test 
become impossible when for successive days the individual “‘drive”’ 
or persistency, the ability for consistent work, is measured. The 
emotional make-up, the ability to meet problems and cope adequately 
with them from day to day is but another criterion of intelligence. 
This is what is measured by the practice test. The results of this 
study indicate the value of the practice test as an administrative and a 
learning device. 








GUESSING IN A TRUE-FALSE TEST 
MARTIN F. FRITZ 
Kansas State Agricultural College 


Almost every teacher who has used true-false tests to any extent 
has been puzzled, no doubt, as to the validity of deducting two points 
for each error. Although this seems to be a rather severe penalty, not 
to do it is placing a premium upon guessing. Occasionally a teacher 
will be found who deducts only one point for each mistake but this is 
unfair to the strong student as will be demonstrated. Now, the 
question concerning the guessing ability of students quite naturally 
arises. It was for the purpose of shedding light on this problem that 
this investigation was undertaken. 

A rather thorough examination of answers actually given by 
students in true-false tests reveals clearly that there is a marked 
tendency to answer ‘“‘true” rather than “false.” In 19 true-false 
examinations given by four college instructors (211 statements were 
true and 209 were false) a total of 3065 errors were made; 1975 or 
64 per cent of the errors were “true” answers (incorrect, of course) 
and 1090 or 36 per cent were “false” answers. There is a margin of 
28 per cent between the two indicating an unmistakable tendency to 
give ‘“‘true” reactions rather than “false.” It is interesting to note 
that this tendency held true for each of the instructors, the above being 
merely the pooled results. The tests were given in various courses in 
psychology and education. 

It might be claimed with a certain degree of justification that the 
students had learned to react positively to the instructor’s statements. 
That is to say, the instructor can lead the students to believe by the 
manner in which the statements are made, the intonation of the voice 
or by general attitude, thus injecting his own personality into the test 
and ruiningits objectivity. It might also be argued that if thestudents 
were confronted with a series of statements about which they knew 
absolutely nothing the result would be a 50:50 chance of guessing half 
correctly instead of the ratio found above under actual class- 
room conditions. 

To test these points a true-false examination of 58 statements was 
prepared. Fifty of these statements were taken from the Encyclopedia 
Medica and since they were so very technical in nature there was 
almost no chance that any student would know any of the correct 
answers. The following statements are quite typical: 
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1. Sublingual fibroma is sometimes called Riga’s disease. (true) 

2. Auerbach’s plexus is to be found in the upper spinal region. 
(false) 

3. Dermatolysis frequently leads to synopsia. (false) 

4. A diminution of carbon dioxide in the blood is called Acapnia. 
(true) 

5. After death, the liquid in the subarachnoidal space is soon 
absorbed. (true) 

Not only were there equal numbers of true and false statements 
but the test was so balanced that every successive group of ten 
questions would be half true and half false. Runs were avoided, three 
of a kind being the longest used. 

If 50 statements so foreign to a student’s thinking were presented 
the latter part of the test would receive very little thought as compared 
to the first part. In order to keep up the interest, eight very simple 
statements were placed at irregular intervals throughout the test. 
It was very gratifying to note the stimulating effect of this little 
scheme. 

The confidence of the students was won by informing them that 
the test was an experiment and that under no conditions would their 
score affect the grade they would receive in the course. Participation 
in the experiment was made optional but all those taking part were 
expected to abide by these rules: 

1. Think about each statement and answer it just as you would ina 
regular class examination. 

2. Answer every question. 

3. If you do not know make your best guess. 

They were informed that the purpose of the experiment would be 
revealed afterward and that they would be expected to follow the 
above rules in a sportsmanly manner. The students seemed to be 
very much interested and eager to take the examination. Observation 
of their conduct during the experiment indicated that they were really 
making an effort to carry out their part of the experiment and private 
consultation with a few individuals afterward showed that most 
students had given their reactions conscientiously. 

Two groups were chosen, one which had finished nearly a semester 
of work under the instructor and another which was coming to class 
for the first time and therefore totally unacquainted with the instruc- 
tor’s methods. The two groups were composed of 121 and 73 students, 
respectively. 
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The results were quite comparable to those found under regular 
class conditions. The first group which had become accustomed to 
the instructor’s style of teaching gave a total of 6050 reactions. Out 
of this number 3685 or 60.9 per cent were ‘“‘true.’”’ The second or 
new group gave a total of 3650 reactions. 2263 or 62 per cent were 
“true.” The significant thing to note is that there seems to be a 
native tendency to give “true” reactions. We have assumed in the 
past that when a student makes a guess it is equivalent to the tossing 
of a penny but the above data should lead us to the conclusion that 
our hypothetical penny has been ‘‘loaded”’ to give better than 60 per 
cent ‘‘heads” (‘‘true’’) out of a total of nearly 10,000 tosses! 

If two points are to be deducted for each error it would be well to 
make up the examination in the ratio of at least six true statements to 
four false. Pointing out to the students that they tend to answer 
‘“‘true” when they should be answering “false” will evidently not solve 
the problem. The first group of students mentioned above who were 
acquainted with the instructor’s methods were informed immediately 
after the experiment that in actual practice they tended to answer 
“true” rather than “‘false.” They were merely told of this fact but 
nothing was said that they should take it into account in their next 
examination. A week later in a final examination these students 
gave 41 per cent incorrect ‘‘true’”’ responses and 59 per cent incorrect 
“‘false’”’ responses out of a total of 489 errors. Almost the reverse of 
what was.found before they gained the information! Deducting only 
one point for each error results in a telescoping of the grade scale. 
Let us take as an example a true-false test of 100 statements. Now 
suppose that one student knows 90 of the statements and need not 
guess on them. He can guess, theoretically, half of the remaining ten 
(weighted six true to four false) and his score would be 95. Another 
student knowing 60 of the statements can guess 20 of those remaining. 
His grade would be 80. The difference in the actual amount of 
knowledge of these two students is 30 per cent but their grades differ 
by only 15 per cent! The poorer the student the greater his reward for 
guessing. 

Returning again to the bona fide class examinations, we find that 
reflection is a marked aid if a student can bring himself to the point 
where he will actually change his answer. Out of 320 changes, 211 
resulted in correct responses and only 110 resulted in errors. This 
means nearly a 2:1 chance in favor of a change. It is frequently 
stated that the first impression is most likely to be correct. We cannot 
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tell in this case whether it is a change of the first impression or whether 
the student has failed to put down his first impression and is merely 
returning to it. 

An excellent problem in research might be carried on by some 
careful worker in high school to determine whether or not the same 
results as found in this study would be secured with secondary school 
students. Also it would be well to know just how valuable the true- 
false test is as an instrument for creating ability to consider a proposi- 
tion from various angles. Certainly this mental attitude should be 
developed before the student ever reaches the college campus. 


SUMMARY 


1. There is a marked tendency on the part of students to give 
“true” reactions rather than “false.” The ratio is approximately 
62 per cent to 38 per cent, respectively. This indicates that the 
guessing of students is not equivalent to the tossing of a penny. 

2. This tendency to give “true” reactions is about the same 
whether we take actual class materials or whether we use experimental 
materials with which the students are totally unacquainted. This 
would seem to indicate a native inclination to believe. 

3. Informing the students of this tendency will probably not 
eliminate the difficulty. The above ratio was approximately reversed 
when this was tried. 

4. Changing an answer once given resulted in a 2:1 change of 
gaining a point. 








A FEW NOTES ON AGE AND SEX DIFFERENCES 
IN MECHANICAL LEARNING 


V. E, FISHER 


New York University 


For the purpose of this study a simple type of substitution test was 
used, wherein the individuals were required to write certain digits in 
geometrical characters arranged in horizontal rows, mixed order, on 
regular-sized sheets of paper. There were but five different charac- 
ters: An equilateral triangle appearing in two different perspectives, 
@ square, a diamond, and a U-shaped figure. Only one form of the 
test was used, containing 70 characters, each character appearing an 
equal number of times in each row. 

In giving the test, a large key or chart containing one each of the 
five characters (enlarged to a diameter of six inches) with the proper 
digits inserted, was employed. This key was manipulated by means 
of a stand with a string and pulley attachment. 

Ten practices were given of 30 seconds each, interspersed with rest- 
periods of the same length. A fresh test sheet was used for each 
practice and the used test sheets were placed to one side out of view 
of the reactor; also, of course, the key was exposed only during the 
practice periods. 

Data were collected from 382 individuals, distributed with respect 
to age and sex as indicated by the following table. Special care was 
taken to avoid getting a selected group relative to any one age or sex. 
All the pupils beyond the age of eight in one school were tested and 
whole classes from two other schools. These individuals were repre- 
sentative of a highly homogeneous population, all coming from the 
better residential section of a city of 125,000 population (Salt Lake 
City, Utah). No foreign children, feeble-minded children or children 
with known marked physical defects were tested. Furthermore, in 
order to minimize the effect of any possible unknown objective or 
subjective factor upon a particular age or sex-group, the children 
were taken from the school room without regard to age or sex. This 
resulted in each test-group—from 15 to 20 were tested at a time— 
being mixed both with respect to age and sex. External conditions 
were kept reasonably constant and the same examiner did all the 
testing and grading of the papers. 
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Differences in Mechanical Learning 


DISTRIBUTION OF REACTORS 


NumsBer or INDIVIDUALS 


AaEr Mae FEMALE Tora. 
Ros cick > a nes bd eed ob ee 13 15 28 
th te lesb ak hO ee 6 Aaa e 4 Aladin 14 16 30 
| et SE Se CESEEEECES LE EE EL EE 20 16 36 
kbd « dids'oerhsscsddvas dbadals 18 22 40 
EOE Serie. ee 0 25 25 50 
Bie 6 oid oh shgule 40d << Xess abt 25 25 50 
ith tik bs b date de RNs duttie debe 25 25 50 
I ee ee | TS ee 18 30 48 
Adults—20 to 30 years old........ 25 25 50 
183 199 382 


Table I gives the mean, median, first quartile and third quartile 
scores of the various age-groups, considered separately relative to sex. 
It may be seen from this table that there is a fairly uniform rate of 
improvement in the girls from the age of 9 to 13 and in the boys from 
9 to 14, at which respective ages there is a sharp loss or decline in 
efficiency. This lower degree of efficiency is characteristic of the two 
succeeding age-groups, both sexes, for all the measures taken with the 
single exception of the third quartile score for the 15-year-old boys. 
In both sexes the adult-groups are the most efficient of all. 

From among the individuals tested, 83 boys and 74 girls were 
selected at random and given the Army Alpha. Their intelligence 
scores thus obtained, were correlated with their scores in the substitu- 
tion test. Table II gives the coefficients of correlation obtained, all 
of which are so small as to be quite negligible. 

The time at which the loss in efficiency in the substitution test 
occurs, together with the indication that it takes place about a year 
earlier in the girls than in the boys, may suggest to some the influence 
of pubescent factors. If so, then the exact nature of these factors 
must be of considerable interest to the educator and to the person 
who is dealing with adolescents since the nature of the above test is not 
wholly foreign to much of the work that is required of the individual 
in his school work. 

The investigators cited below have obtained results which are to 
an extent comparable to those given above, although the types of 
material used and the methods of giving the tests differ quite radically. 
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TasLe I.—Tue Scores Are Given HERE ror VARIOUS MEASURES, FOR THE Two 
Sexes ConsIpERED SEPARATELY, IN TERMS OF THE AVERAGE TOTAL NUMBER 
or Correct SupstiruTions MADE DURING THE TEN TuIRTy-SECOND 
PERIODS OF PRACTICE 









































Boys 
Bad..ivsccccsccccccvceel @ | WM | 191 Wi wi wl le ee 
First quartile........... 95 | 119 | 161 | 184 | 179 | 203 | 181 | 170 | 223 
| eC a 113 | 139.| 182 | 201 | 216 | 230 | 196 | 199 | 238 
Third quartile........... 139 | 179 | 205 | 232 | 245 | 256 | 275 | 239 | 272 
PNR oc eicucidies exe 116 | 144 | 180 | 206 | 216 | 232 | 218 | 218 | 245 
Girls 
First quartile........... 105 | 112 | 155 | 177 | 199 | 179 | 184 | 190 | 211 
a ss oa ashe kee 115 | 137 | 178 | 207 | 247 | 210 | 205 | 224 | 250 
Third quartile........... 130 | 167 | 199 | 235 | 275 | 246 | 248 | 263 | 293 
tt SR seg eit at 108 | 138 | 180 | 208 | 239 | 213 | 212 | 224 | 246 
































TaB_e [I.—GrvinaG THE COEFFICIENTS OF CORRELATION OBTAINED BETWEEN THE 
SusstiTuTIon Test Usep AND THE ARMY ALPHA INTELLIGENCE TEST, TAKEN 
AS A WHOLE AND ALSO RELATIVE TO CERTAIN OF THE INDIVIDUAL TESTS. 
THE Roman NuMERALS FOLLOWING THE ARMY ALPHA, INDICATE THE 
Various Sus-Tests CONSIDERED. PEARSON’S PRODUCT 
Moments Metsuop Was EMpLoYeD 























Coefficient of correlation 
Substitution test Army Alpha 

Boys Girls 
Substitution test | Complete test 34 .20 
Substitution test | Army Alpha I .23 .05 
Substitution test | Army Alpha V .22 . 26 
Substitution test | Army Alpha VI . 20 01 
Substitution test | Army Alpha VIII . 28 .16 
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MENTAL MEASUREMENT 


Non-language Tests in Foreign Countries. R. Pintner. School and Society, 
Sept. 17, 1927, 374-376. The performance of certain groups of Belgian children 
on the Pintner Non-language Mental Tests is compared with the norms of Amer- 
ican children. 

A New Method of Weighting and Scaling Mental Tests. F. Kuhlmann. The 
Journal of Applied Psychology, June, 1927, 181-198. The method consists 
simply of using an age norm for each trial in each test in the group test battery. 
This method takes into account a pupil’s level and permits alteration of the test 
elements. The unreliability of extreme measures and of averages as norms is 
somewhat reduced. 

A Comparison of the Intelligence of Extension and College Undergraduate Students. 
Harold Ellis Jones. School and Society, Oct. 8, 1927, 469-470. At Columbia 
University the College students rate significantly higher on the Army Alpha than 
do the Extension students. 

The Results of Some Psychological Tests at Bryn Mawr. Esther Crane. School 
and Society, May 28, 1927, 640-644. From 1920 to 1924 the coefficient of cor- 
relation between academic marks and psychological tests had a central tendency 
of .32. The plan of using both entrance examinations and marks in psychological 
tests seems to be more useful for selecting students for admission to college. 

The Relation of the Intelligence of Pre-school Children to the Education of Their 
Parents. Florence L. Goodenough. School and Society, July 9, 1927, 54-56. A 
total of 213 children between 18 and 54 months of age were tested twice on the 
Kuhlman Binet. The correlation between score on this test and the education of 
parents was .35. 

Thorndike College Entrance Test Results in a Senior High School. Gertrude 
Hildreth. Teachers College Record, June, 1927, 1035-1043. A cumulative 
study of the Thorndike Tests in relation to school marks and Stanford Binet 
IQ’s. 

Intelligence and Safety. Max S. Henig. Journal of Educational Research, 
Sept., 1927, 81-87. A direct relation exists between intelligence as revealed by 
the Army Alpha tests and liability to accidents. 


ABILITY GROUPING 


Teaching Reading through Ability Grouping. James M. Shields. The Journal 
of Educational Method, Sept.—Oct., 1927, 7-10. A suggestive adaption of meth- 
ods of teaching to groupings on both comprehension and speed of reading. 
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A Comparison of Two Methods of Instruction. Norma O.Sheedenann. School 
and Society, June 4, 1927, 672-674. “Individualized” method and lecture- 
conference method in elementary psychology classes were compared. That 
these two methods of instruction are of equal effectiveness in teaching elementary 
psychology seems to be warranted by the results of this experience. 

Segregation on the Basis of Ability. H.W. Miller. School and Society, July, 
1927, 84-88, 114-120. The conclusions arrived at through a study of segregating 
college students are that the bright students are especially benefited by such a 
sectioning; sectioning on the basis of past school records gives ‘‘an accurate meas- 
ure of the student’s preparation, intelligence, and inclination.” 

An Experimental Study of the Effect on Learning of Sectioning College Classes 
on the Basis of Ability. Oscar Alvin Ulrich. University of Texas, June, 1926. 
No significant advantages were found as a result of sectioning students according 
to a psychological examination where the course is elementary educational psy- 
chology conducted by the lecture method. 


CuiILtp PsycHOLOGY 


Some Observations of Infant Learning and Instincts. P. P. Brainard. The 
Pedagogical Seminary and Journal of Genetic Psychology, June, 1927, 231-254. 
A rather detailed genetic study of the development of an infant. 

Conversation among Children. Claire T. Zyve. Teachers College Record, 
Oct., 1927, 46-61. A study of a third grade classroom situation. The spon- 
taneous conversation of children was recorded in order to determine topics of 
interest, developmental nature of spontaneous conversation, vocabulary usages, 
and slang usages. The present study has endeavored to show the development 
of children in conversation when they are a social group under their own control, 
with subjects of their own choosing. 

Percept Content of School Children’s Minds. Ray L. Huff. The Pedagogical 
Seminary, Mar., 1927, 129-143. Eight hundred and forty children (kindergarten 
to high school) were tested for ‘‘percept content’’ by a check list of percepts, 
known and unknown. The author believes that percept content should serve as a 
factor to be considered in teaching and advancement of pupils. 


NEW-TYPE EXAMINATIONS 


Some New-type Test Forms in High School Physics. Hans C. Gordon. Oct., 
1927, 721-731. Thirty-three variations of new-type questions have been col- 
lected from the literature of the field, and illustrated by subject-matter chosen 
from the field of high school physics. 

A Modified Form of the True-false Test. Howard V. McClusky and Francis 
D. Curtis. School Science and Mathematics, Apr., 1927, 362-366. A modified 
form of the true-false statement in which pupil corrects false statements is found 
to be more time-consuming, better adapted for drill, a better power test, more 
reliable, more difficult, and better in its tendency to eliminate guessing than the 
customary true-false test. 

Improving the Objectives—Test Question. J. T. Giles. The School Review, 
April, 1927, 286-288. Suggests the use of an “eliminate the worst’’ alternative 
as preferable to the customary multiple choice. 
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MISCELLANEOUS 


Do Teachers’ Marks Vary as Much as Supposed? Frederick E. Bolton. Edu- 
cation, September, 1927, 23-39. The author presents data tending to show that 
teachers’ marks on arithmetic papers are much more reliable than past studies 
have demonstrated. 

College Achievement of Private and Public School Entrants. Llewellyn T. 
Spencer. School and Society, October 1, 1927, 436-438. The public school 
men in four Yale classes are superior to the private school group in intelligence 
test scores, academic grades, frequency of graduation, and freedom from resig- 
nations. The mixed group for the most part occupies an intermediate position. 
The private school men, however, surpass the public school men in entrance 
examination grades. 

A Course in the Technique of Educational Research. Percival M. Symonds. 
Teachers College Record, October, 1927, 24-30. This study was based on analysis 
of the contents of recent Teachers College Contributions to Education. A course 
and comments on educational research are included in the article. 

A Non-verbal Will-temperament Test. Richard 8. Uhrbrock and June E. 
Downey. The Journal of Applied Psychology, April, 1927, 95-105. A test 
similar in instructions to the Downey Group Will-temperament Test but differ- 
ing in the nature of the tasks. Low correlations were secured between this 
non-verbal test and (1) the verbal test (2) the National Intelligence Test. ‘‘Self- 
confidence” on the non-verbal test is not one and the same thing as “‘self-con- 
fidence’’ on the verbal test. 

A Study of the Scientific Interests of Dwellers in Small Towns and in the Country. 
Francis D. Curtis. Peabody Journal of Education, July, 1927, 22-34. On 
request 700 adults and children submitted 7000 questions regarding science. A 
rank of order of interestes mentioned is also included. The questions and the 
different scientific interests are predominantly physical rather than biological . . . 
Studies of newspaper and magazine science are considerably at variance with the 
results . . . These rural dwellers ask a relatively smaller percentage of ques- 
tions seeking general information and a much larger percentage of technical 
questions than the city dwellers . . . The combining of data from several 
studies tends to diminish the importance of purely local interests. 
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DIAGNOSTIC AND REMEDIAL METHODS IN TEACHING READING 


The Improvement of Reading, by Arthur I. Gates. New York: Mac- 
millan Co., 1927. Pp. XII + 440. 


Careful research in diagnosing and remedying the unhappy results 
of the present methods of teaching reading has resulted in this new 
book of Mr. Gates’. As a few visits to representative classrooms will 
show, many different procedures are now being employed. Every 
method which has ever been devised is in vogue today. In some 
schools the alphabet method is still being used. In others, words and 
phonograms are taught the first day. On the other hand, in the “‘ bet- 
ter”’ schools, it is heresy even to mention “ phonics”’ until the children 
can read sentences and phrases—“‘ meaningful material.” And finally, 
in still others, children are thought literally to ‘‘burst forth” into 
reading without any teaching at all. 

This condition is rapidly becoming unjustifiable, especially in the 
light of research which has been carried on during the past 25 years. 
Within that time several groups and individuals have studied the 
problem by various methods. Some have approached it through 
scientific analysis. Others have made use of prolonged observation of 
and reflection on classroom procedure. Whatever the approach, 
the results establish certain definite, unqualified conclusions upon some 
questions and the need for reservation of judgment upon others. 

Mr. Gates is one of those who has used the scientific method. With 
the hypothesis that ‘until the day is reached, if ever, when methods 
of teaching reading are perfect, diagnostic and remedial techniques 
will be essential,’’ he conducted this careful research. It extended 
over a period of eight years and included more than 13,000 cases. 
Tests were constructed to measure important abilities and to reveal 
specific weaknesses. Remedial measures were devised to remove defi- 
ciencies. The “‘recommendations were validated by means of appro- 
priate experimental and statistical analysis.” And, finally, the 
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results were incorporated into a program of diagnosis and remedy 
which Mr. Gates has integrated into what he calls an ‘intrinsic 
method”’ of teaching reading. 

The program of diagnosis and remedy with it, its nature and char- 
acteristics is clearly set forth in the book There is a primary grade 
diagnosis which consists of three types of tests: 1, for word recognition; 
2, for phrase and sentence reading; 3, for paragraph comprehension. 
They are fully described so that diagnosticians and teachers can use 
them. After tests have been given and the results have been 
arranged, children can be classified into diagnostic types for which 
there are specific remedial suggestions. One valuable suggestion is 
what Mr. Gates calls the “intrinsic” device. It is a genuine reading 
activity—not an extraneous, supplementary device—introducing the 
particular skill desired into the reading process. For example, instead 
of the supplementary, phonetic drill on at, an intrinsic device is intro- 
duced into a reading situation as part of a comprehension problem: 

Who got most of the milk? The cat 

The rat 
The bat, etc. 

The reading of Grade III and classes above is diagnosed by four 
tests. Here emphasis is placed upon the significance of reading for 
specific purposes. Each test, therefore, measures a different type of 
ability such as reading to get the general significance of a paragraph, 
to note details, to follow directions, and so on. Suggestions and exer- 
cises are given to remove both mild and serious deficiencies in these 
specific abilities. 

In addition to the diagnosis of the results of reading, Mr. Gates 
was also concerned with the analysis of ‘bodily mechanisms” and 
other specific capacities upon which reading depends. Through this 
analysis, he hopes to remedy serious difficulties and disabilities, and 
eventually to approximate the native capacities of pupils for the tasks 
required for reading. To that end, tests were devised which measure 
visual, auditory and motor functions, associative learning, nervous 
and emotional stability. Remedial measures were outlined for pupils 
who possess such disabilities. Included in this group are the dull, 
those deficient in hearing, deaf-mutes, and others. 

Mr. Gates’ work is a significant achievement in the scientific study 
of learning in one school subject. The problem of reading was 
attacked from the standpoint of separate specific abilities or weaknesses, 
and intense study was made upon the possibility of remedying each 
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deficiency and developing each ability. Instruction and remedial 
measures in the form of “intrinsic” devices accomplish this purpose. 
Within this field, it is the most practical program which has yet been 
made. 

However, in my judgment, it is a scheme of devices, either for 
remedy or instruction, and not a “method” of teaching reading. 
Emphasis is placed upon skills and specific abilities with the view that 
mastery of these will bring about all the attitudes of interest and desire 
‘ which attend the learning of reading. It is safe to say, I think, that 
no method is adequate which stresses the intellectual goals almost 
exclusively and which either sacrifices the emotional goals or expects 
them to emerge afterward. No devices, however ‘‘intrinsic,’’ can 
take the place of interesting, productive, initial teaching. They are 
only a part—although an important one—of a composite, eclectic 
‘“method” which will include procedures to arouse children to an inter- 
est in reading, which will surround them with interesting and diverse 
material. It will include exercise material to increase the perception 
span, wise use of phonics, exercises which will develop specific abili- 
ties such as reading to paraphrase, to get the literary style, etc. Mr. 
Gates has some of these, but not all. Louise C. KrRuEGER. 

Browning School, New York City. 





HELPS FOR THE KINDERGARTEN-PRIMARY TEACHER 


Psychology of the Kindergarten-primary Child, by L. A. Pechstein and 
Frances Jenkins. Boston: Houghton Mifflin Co., 1927. Pp. 
XV + 281. 


A simple, non-technical, reliable, but not unusual treatment of 
some of the general facts of the psychology of the kindergarten-primary 
child is followed by a separate section dealing with specific applications 
to practical schoolroom situations. The plan of treatment is that of 
giving in Section I a general scientific orientation in which the origin, 
development, and present status of child psychology is traced against 
a background of Froebelian philosophy. Growth, inheritance, the 
elements of learning, and the three-fold intellectual, volitional, and 
emotional nature of the child are analyzed. Discussions of individual 
differences, the mental basis of classification, and some remarks upon 
the problem of moral development conclude the first section. 

The second section deals more specifically with the child at school, 
giving concrete suggestions as to school organization in matters of 
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grouping, furnishings, promotion standards, the order of the day’s 
work. Children’s experiences are recommended as providing the 
best basis for enriched living and as arousing a need for the racial tools 
of language, reading and number. Questions and problems for study 
follow all of the chapters, and suggestive bibliographies are frequently 
inserted. 

On the whole, the first section of the book is superior to the second 
section, which is too brief and general in nature to offer a rich variety 
of suggestions for the teacher in need of them. The chapter on the 
fine arts seemed notably inadequate in this respect. It seems to me 
that the organization and effectiveness of the book might have been 
materially improved if the authors had collaborated instead of working 
independently. After all, the aim was “‘to coordinate practice with 
the underlying science,’”’ and this cannot be adequately done for the 
young teacher by divorcing too noticeably theory and generalization 
from concrete illustration and practical application. However, the 
book should be a real contribution to teacher-training classes, and may 
well form a part of any kindergarten-primary teacher’s professional 
equipment. ANN SHUMAKER. 

Graduate Student, Teachers College. 





Tue Usre or MENTAL TESTS IN THE PSYCHOLOGICAL CLINIC 


Mental Tests in Clinical Practice, by F. H. Wells. Yonkers, N. Y.: 
The World Book Co., 1927. 


As compared with the volume of published material relating to 
experimental work with mental tests, the amount of material descrip- 
tive of the many minutie# connected with the administration of tests 
commonly used in psychological clinics has been meagre. Dr. Wells’ 
comprehensive treatment of the subject meets the examiner’s need for 
supplementary suggestions in regard to testing which are not to be 
found in the separate test manuals. He not only presents his own 
opinion upon controversial points but summarizes the view-points of 
other leading psychologists as well. In describing the scope of the 
book, the author states that ‘‘The book deals with techniques espe- 
cially suited to study of individuals whose general behavior is more 
or less seriously mal-adjusted.”” The book is obviously designed for 
the use of students preparing to administer or to supervise the 
administration of tests in clinics. The book can, however, be read 
with profit by others who have no professional interest in tests. 
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After presenting briefly his method of approach, points of 
view and opinion as to the place of tests in clinical practice, the 
author discusses general examination methods from three points 
of view: 

(1) The surroundings of the examination. (2) The attitude of the 
person examined. (3) The conduct of the examination itself. 

Many suggestions derived from the author’s experience in adminis- 
tering the Stanford-Binet tests, the Kuhlmann Scale, a battery of 
performance tests and an original arrangement of memory tests, are 
included. There is also an adequate and useful account of the free 
association test technique. 

The author anticipates the objections which conservative users 
of the Stanford-Binet tests will make to the freedom with which he 
deviates from the test directions as specified in the manual. He 
- contends that in clinical practice such digression from the test direc- 
tions, as he suggests are justifiable. Many users of tests will regard his 
departures as too radical and will fail to distinguish a “clinical” use 
of the tests from any other use. Until much further experimentation 
has been done, there is no way of knowing what test results mean 
when extreme liberties are taken in their administration. All begin- 
ners should be impressed with the fact that common sense must at all 
times be used in testing, but that extreme caution must at the same 
time be exercised to prevent the taking of undue liberties. In the 
book there should be greater stress upon the importance of recording 
and evaluating by continued experiment any unusual exceptions made 
in test administration. Dr. Wells takes a more rigid attitude toward 
the Kuhlmann tests, possibly because of the fact that the Kuhlmann 
directions allow little opportunity for exceptions to be made. 

Throughout the book the reader feels the need for further research 
work to support or refute the opinions expressed. The question as to 
whether a chronological age of 14 or 16 be used in computing adult IQ 
is discussed. The opinions of experts are valuable but need substan- 
tiation from intensive research. 

The author’s organization of a battery of memory tests is a useful 
contribution to clinical practice. However, in connection with his 
evaluation of the tests it must be noted that the number of cases 
reported is too small to admit of conclusions as to the discriminative 
capacity of differential tests of memory for various psychotic groups. 
The results found may be due to chance. The use of the median per- 
centile is a questionable statistical procedure. 
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The author’s chapter on free association tests is a notable contribu- 
tion to psychologica! literature. There has long been a need for a clear 
and concise discussion of this type of test. 

A brief discussion of the uses of group tests is included in the book. 
The importance of using group tests in measuring the intelligence of 
adults in place of inadequate individual tests could well receive greater 
emphasis than the author gives it. 

Included in the book is a chapter on office methods which the 
author has found efficient and economical in the conduct of a clinic. 
The book is concluded with chapters relating to the practical applica- 
tion of clinical examination methods, to vocational adjustment and 
general personality problems. Comments are given on the use of 
rating scales and a useful list of personality study items is included. 

Kach chapter relating directly to test administration is concluded 
with case histories illustrative of the particular psychometric clinical 
devices described in the chapter. Questions and topics for discussion 
as well as brief references are to be found at the end of each chapter. 
An extensive selected reference list appears at the end of the book. 

GERTRUDE HILDRETH. 
The Lincoln School of Teachers College. 





The Relation between Early Language Habits and Early Habits of 
Conduct Control, by Ethel B. Waring. New York: Teachers 
College Contributions to Education, Teachers College, Columbia 
University, 1927. 


Education in the past has operated on the assumption that language 
stimulation is potent in the control of child behavior. The spoken 
approval of the adult, it was supposed, served to aid the child in select- 
ing and generalizing his experience. 

The author of this monograph has made a rather thorough historical 
study of the problem, followed by an experimental attack upon it. 

Nineteen children, of average age of four years and average IQ 
114 were practiced and tested by use of a great variety of motor and 
discriminative tasks. Non-language and language approval were 
alternately used with these children who were divided into two groups. 
The author concludes that language approval has an immediate and 
carry over effect on behavior. 

This study is very suggestive and unusual in its method of attack 
on the problem. It is intensive rather than extensive in its nature 
and the results statistically, rather than experimentally, sound. 
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The choice of very young children is fortunate, as the returns obtained 
are less complicated and more sure. Teachers might well note not 
only that language encouragement is more effective, but also the type 
of language encouragement used by the experimenter. 

The book does throw some added light on the psychology of learn- 
ing, and should serve as an impetus to further investigation. 


JAMES E. MENDENHALL. 
Lincoln School of Teachers College. 





EDUCATION AND BEHAVIOR 


Education and the Integration of Behavior, by Meredith Smith. New 
York City: Bureau of Publications, Teachers College, Columbia 
University, 1927. Pp. V + 93. 


The problem of behavior has always occupied the attention of 
educators, though in the past it was considered merely as implicit in 
the educational situation rather than to be taken explicitly into 
account. That is, behavior was considered chiefly when it took the 
form of misbehavior. 

The educational philosophy of John Dewey has re-oriented the 
entire educative process, but in nothing has it brought about such 
radical changes as in the educational attitude toward behavior. Phi- 
losophy has not, however, had the whole-hearted support of science, 
and attempts to reconcile the two have not always been either success- 
ful or fruitful. This study is remarkable in that it not only sets up a 
reconciliation but also actively integrates a fundamental viewpoint in 
modern educational philosophy with the most recent findings of science. 

“Based on recent physiological discoveries, this study formulates 
a theory of behavior which considers experience from the biological 
viewpoint of the goal of all organic life; namely mastery over the 
environment. 

‘The process of approach throws into relief the progressive factors 
in the development of organic life. These factors are viewed as con- 
stituting the basis for the determination of the educative process. 
In the account given of an educational experiment, the method of 
procedure is made explicit.” 

No one who is interested in the theories or practices of modern 
progressive education can afford to overlook this study. It offers a 
real contribution to ‘“‘the child vs. the curriculum” controversy in its 
analysis of experience interpreted as an integrative process of an organ- 
ism and environment. ANN SHUMAKER. 

Graduate Student, Teachers College. 
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: THE PsycHoLogy or PLay 

) a 

e The Psychology of Play Activities, by Harvey C. Lehman and Paul A. 4 

Witty. New York: A. 8. Barnes and Co., 1927. Pp. XVIII + 
he 242. , 44 
The educative significance of play, though generally recognized, | 

has received relatively little objective and scientific verification. In m4 
order better to understand and control behavior, and with the intention ; 
of securing some quantitative bases for determining interests which ‘i 
might aid the vocational counselor, the authors have attempted to 

y discover the play preferences of a number of individuals between the 

a ages of 5 and 22. From this data they have sought to determine the 
effect on play of such variables as sex, age, race, season, intelligence, 

f community, and soon. Play, in this study, was interpreted as those | 

i. activities in which individuals spontaneously engaged in their leisure | 

time. 

“ The plan employed was to place before about 6000 children of the : 
vicinity of Kansas City, a comprehensive list of 200 play activities. 

. The items of this list were determined from previous studies with such By: 

h additions as ‘‘competent persons’? deemed advisable. These lists Bie 

3 were to be checked to indicate (1) those in which the pupil had engaged a4 H 

of his own volition during the week preceding the investigation; (2) is 

: the three activities which he had liked best; (3) those to which he felt : 
he had given the most time; (4) those in which he had participated | 

& . , . 

" alone. In order that seasonal differences might be taken into account, 

the check list was submitted to approximately the same children three 

" times within the year. 


] From the data received in the foregoing manner, tables were com- 
piled, graphs drawn, and conclusions projected concerning the relation 
of play to general age growth; the play preferences of children below 
Grade III; sex differences; race, urban and rural, and seasonal differ- 
ences in play preferences; the relation of play activity to school 
progress, to intelligence, and so on. ; 
f It is obvious that this study is but a beginning and the findings « 
may not be taken as final. The sampling of children was limited to a ; 
very local cultural and geographical area. The check list of activities sat 
was arbitrarily determined rather than being based sufficiently upon 
children’s own stated interest. The check list method must be sup- 
plemented by other techniques in order to be reliable in the best sense. 
Other variable factors such as guidance of play activities must also 
be taken into account. 
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However, a general technique has been suggested, which, if elab- 
orated and applied more widely, should eventuate in more information 
concerning the spontaneous play interests of children and their mean- 


ing for education. ANN SHUMAKER. 
Graduate Student, Teachers College. 





SUPPLEMENTARY READING 


The Supplementary Reading Assignment, by Carter V. Good. Balti- 
more: Warwick and York, Inc., 1927. Pp. XVI + 223. 


This book contains some very suggestive material bearing on the 
high school and college reading problem. The major concern is a com- 
parison of the results of extensive reading with intensive reading. 
The author has tested the effectiveness of practice in these two rather 
distinct types of materials in a multitude of ways—as to information 
gained, problem-solving ability, outlining, reproduction of ideas, and 
permanency of retention. The extensive reading method proved 
superior to the intensive reading method in all except the accuracy of 
information gained, in which case there were no significant differences 
between the two methods. The author also finds that the single rapid 
reading is “relatively effective” and probably preferable to two rapid 
readings or a single slow analytical reading. 

The results of the experiments in this book are suggestive rather 
than conclusive. The small number of cases used, the grade level of 
the students tested, the types of material practiced tend to limit the 
generalizations which may be drawn. The author has been very 
ingenious in the construction of his tests, and in the methods devised 
to motivate types of reading. 

The findings relative to the merits of extensive vs. intensive read- 
ing are widespread in their significance. Intensive reading, which is 
synonymous with textbook reading as commonly practiced, can only 
be justified on the basis of the actual informational value thus acquired. 
The textbook then, if to be read intensively, must contain only mate- 
rial that is of proved social worth. On the other hand, the encourage- 
ment of additional reading containing a wealth of illustrations should 
be urged. The gains in range of information, outlining, etc. as this 
study indicates would justify such a procedure. 

In the role of a significant contribution to the problem of reading, 
this book should prove of value to every teacher or student in 
education. — JAMES E. MENDENHALL. 

Lincoln School of Teachers College. 
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